Previous lecture

Alphabet: A finite set of symbols.
Language: A set of words/sentences, i.e., sequences of symbols from the alphabet.

Different ways to define languages:

by enumerating all elements, e.g.
```
  Teachers ≝ {"Lawrence","Johan"}
```

by using a predicate, e.g.

  Ident ≝ {x ∈ AsciiStrings
          | x contains only printable characters }

by using an inductive definition, e.g.
Let PAL be the smallest language such that:
- ε is in PAL
- a, b, c are in PAL,
- if P is in PAL, then aPa, bPb, and cPc are also in PAL.

This lecture: inductive definitions, with less verbosity!

Palindromes, concisely

Let PAL be the smallest language such that:

ε is in PAL

a, b, c are in PAL,

if P is in PAL, then aPa, bPb, and cPc are also in PAL.

P → ε
P → a 
P → b 
P → c
P → aPa
P → bPb 
P → cPc

The new notation is called a (formal) grammar.

The grammar contains:

Rewrite rules
Symbols from the alphabet (e.g. a,b,c), aka ‘terminals’
Extra symbols not in the alphabet (e.g. P), aka ‘non-terminals’

Derivation

P → ε
P → a 
P → b 
P → c
P → aPa
P → bPb 
P → cPc

We can ‘play’ the context-free grammar to generate strings in the language.

      P
⇒    aPa
⇒   acPca
⇒  accPcca
⇒  accbcca

We call such a sequence a derivation.

We say “P generates the string accbcca”.

We say “P generates the language {ε,a,b,c,aa,bb,acca,...}”.

Multiple nonterminals

S → A 
S → B 
A → a 
A → AA 
B → b
B → BB

Grammars can have multiple nonterminals.

Question: What are the non-terminals in this grammar?

Question: What is the language generated by S?

Language of (single) digits

Dig → 0
Dig → 1
Dig → 2
Dig → 3
Dig → 4
Dig → 5
Dig → 6
Dig → 7
Dig → 8
Dig → 9

Abbreviation: write multiple productions on one line with ‘|’

Dig → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

(We still count ten productions!)

Sequences of digits

Digs → ε | Dig Digs

This grammar allows sequences with leading zeros:

    Digs
⇒   Dig Digs
⇒   Dig Dig Digs
⇒   Dig Dig Dig Digs 
⇒   Dig Dig Dig ε
⇒*  007

The symbol ⇒* means that we make multiple (zero or more, but finitely many) derivation steps at once.

We also allow the star notation on the right hand side of a grammar to abbreviate zero or more occurrences of symbols:

Digs → Dig*

Natural numbers

To disallow leading zeros we introduce another nonterminal:

PDig → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Nat → 0 | PDig Digs

Integers

Sign → + | -
Int → Sign Nat | Nat

The sign is optional.

There is an abbreviation for optional symbols as well:

Int → Sign? Nat

Letters

Letters are like digits.

SLetter → a | b | ... | z 
CLetter → A | B | ... | Z

(52 productions in total.)

Letter → SLetter | CLetter

Identifiers

In many languages, identifiers must not start with a number, but can have numbers following an initial letter.

Identifier → SLetter AlphaNum*
AlphaNum → Letter | Dig

Variations are easy to define (e.g. allowing ‘_’).

A fragment of C

Stat → Var = Expr ;
     | if ( Expr ) Stat else Stat
     | while ( Expr ) Stat

Expr → Integer
     | Var
     | Expr Op Expr

Var → Identifier

Op → + | - | *

Start symbol

One nonterminal in the grammar is called the start symbol.

This lets us be lazy and say

the language generated by the grammar

instead of

the lanauge generated by the start symbol

We usually assume the first symbol is the start symbol. The start symbol is often called S.

Context-free grammars

The grammars we consider are restricted:

the left hand side of a production always consists of a single nonterminal

Grammars with this restriction are called context-free.

Some facts about context-free grammars:

Languages that can be described by a context-free grammar are called context-free languages.
Context-free languages are relatively algorithm-friendly
Many languages cannot be described by a context-free grammar.
Many programming languages are (almost but not quite) context-free

Multiple grammars for one language

Multiple grammars may describe the same language.

S → aS 
S → a

S → Sa 
S → a

S → SS 
S → a

S → AS 
S → A 
A → a

We call grammars that generate the same language “weakly equivalent”.

Multiple derivations for one sentence

Consider the grammar:

S → SS 
S → a

These are three derivations of aaa:

S ⇒ SS ⇒ aS ⇒ aSS ⇒ aSa ⇒ aaa
S ⇒ SS ⇒ aS ⇒ aSS ⇒ aaS ⇒ aaa
S ⇒ SS ⇒ Sa ⇒ SSa ⇒ aSa ⇒ aaa

Question: How is the third different from the others?

Parse trees

We can visualize a derivation as a parse tree:

(SEE BOARD)

Ambiguity

A grammar where every sentence corresponds to a unique parse tree is called unambiguous.

If this is not the case, the grammar is called ambiguous.

The grammar

S → SS 
S → a

is thus ambiguous.

Question: Why are ambiguous grammars bad?

Ambiguity and parse trees

Given the (ambiguous) grammar:

S → S-S 
S → 1

The sentence 1-1-1 corresponds to two parse trees:

Ambiguity and semantics

Evaluating these expressions (where ‘-’ denotes subtraction):

the left tree corresponds to the value 1,
the right tree corresponds to the value -1.

Hence, ambiguous grammars lead to ambiguous semantics.

Later, we will also see that ambiguous grammars can cause inefficiency.

Beware: syntax vs. semantics

Who who said ‘-’ performs subtraction?

1-1-1 could mean the value 1 or -1,
1-1-1 could mean the 1st of January in year 1,
1-1-1 could mean that the first item in a table should be copied three times,
…

A language defines which sentences are syntactically correct. Processing these sentences (‘semantics’) is a separate step.

Famous Example: Dangling else

S → if b then S else S
  | if b then S
  | a

Question: How do you parse

if b then if b then a else a?

Exercise 2.17

Ambiguity is a property of grammars

All of these grammars describe the same language:

S → aS 
S → a

S → Sa 
S → a

S → SS 
S → a

S → AS 
S → A 
A → a

Which are ambiguous?

Note: some CF languages have only ambiguous grammars.

Grammar transformations

A grammar transformation is a mapping from one grammar to another, such that the generated language remains the same.

Formally:

A grammar transformation maps a grammar G to another grammar G' such that
LangOf(G) = LangOf(G')

Grammar transformations can help us to transform grammars with undesirable properties (such as ambiguity) into grammars with other (hopefully better) properties.

Most grammar transformations are motivated by facilitating parsing.

Parsing, concrete vs abstract syntax

Formally

Given a grammar G and a string s:

The recognition problem is to decide whether or not s ∈ LangOf(G)
The parsing problem is to decide whether or not s ∈ LangOf(G), and provide a parse tree if it is.

Parse trees in Haskell

Consider this grammar (What is the language? Is it ambiguous?)

S → S-D | D 
D → 0 | 1

The string 1-0-1 parses to

Parse trees in Haskell – contd.

Idea: Let us represent nonterminals as datatypes:

In every node of the parse tree, we have a choice between one of the productions for the nonterminal in question.
If we want to build a value of a Haskell datatype, we have a choice between any of that datatype’s constructors.

Hence, productions become constructors.

Concrete and abstract syntax

Concrete

S → S-D | D 
D → 0 | 1

Abstract

data S = Minus S D | SingleDigit D 
data D = Zero | One

Both descibe the language.

The string 1-0-1 corresponds to the abstract syntax

Minus (Minus (SingleDigit One) Zero) One

Semantic functions

Concrete

S → S-D | D 
D → 0 | 1

Abstract

data S = Minus S D | SingleDigit D 
data D = Zero | One

A semantic function consumes an abstract syntax. These are very useful for building compilers!

e.g. pretty-printing

printS :: S → String 
printS (Minus s d) = printS s ++ "-" ++ printS s
printS (SingleDigit d) = printD d 

printD :: D → String 
printD Zero = "0" 
printD One = "1"

printS (Minus (Minus (SingleDigit One) Zero) One )

evaluates to 1-0-1

Semantic functions

Concrete

S → S-D | D 
D → 0 | 1

Abstract

data S = Minus S D | SingleDigit D 
data D = Zero | One

A semantic function consumes an abstract syntax. These are very useful for building compilers!

e.g. evaluation

evalS :: S → Int 
evalS (Minus s d) = evalS s - evalD d 
evalS (SingleDigit d) = evalD d 

evalD :: D → Int 
evalD Zero = 0
evalD One = 1

evalS (Minus (Minus (SingleDigit One) Zero) One)

evaluates to 11

Summary

Grammar A way to describe a language inductively.

Production A rewrite rule in a grammar.

Weakly equivalent Grammars that generate the same language.

Context-free The class of grammars/languages we consider.

Nonterminal Auxiliary symbols in a grammar.

Terminal Alphabet symbols in a grammar.

Derivation Successively rewriting from a grammar until we reach a sentence.

Parse Tree Tree representation of a derivation.

Ambiguity Multiple parse trees for the same sentence.

Abstract Syntax (Haskell) Datatype corresponding to a grammar.

Semantics Function defined on the abstract syntax.

Beware: object & meta language

We use language to talk about other languages.

We may use English words in a formal language; they’re different from the English words we use to talk about them.
We may use Haskell code to implement a language like Haskell, or C#code to implement a language like C#.
Don’t confuse the two levels.

Summary

Alphabet: A finite set of symbols.
Language: A set of words/sentences, i.e., sequences of symbols from the alphabet.
Grammar: A way to define a language by means of rewrite rules.
Parse tree: A proof that a word is in the language described by a grammar.

Grammars and Parsing

Previous lecture

Palindromes, concisely

Derivation

Multiple nonterminals

Examples

Language of (single) digits

Sequences of digits

Natural numbers

Integers

Letters

Identifiers

A fragment of C

Grammar theory

Start symbol

Context-free grammars

Multiple grammars for one language

Ambiguity, Parsing

Multiple derivations for one sentence

Parse trees

Ambiguity

Ambiguity and parse trees

Ambiguity and semantics

Beware: syntax vs. semantics

Famous Example: Dangling else

Ambiguity is a property of grammars

Grammar transformations

Parsing, concrete vs abstract syntax

Formally

Parse trees in Haskell

Parse trees in Haskell – contd.

Concrete and abstract syntax

Concrete

Abstract

Semantic functions

Concrete

Abstract

Semantic functions

Concrete

Abstract

Summary

Beware: object & meta language

Summary