A finite set of symbols.
A set of words/sentences, i.e., sequences of symbols from the alphabet.
Different ways to define languages:
by enumerating all elements, e.g.
Teachers ≝ {"Lawrence","Johan"}
by using a predicate, e.g.
Ident ≝ {x ∈ AsciiStrings | x contains only printable characters }
by using an inductive definition, e.g.
Let PAL be the smallest language such that:
εis in PALa,b,care in PAL,- if
Pis in PAL, thenaPa,bPb, andcPcare also in PAL.
This lecture: inductive definitions, with less verbosity!
Let PAL be the smallest language such that:
εis in PALa,b,care in PAL,- if
Pis in PAL, thenaPa,bPb, andcPcare also in PAL.
P → ε
P → a
P → b
P → c
P → aPa
P → bPb
P → cPc
The new notation is called a (formal) grammar.
The grammar contains:
a,b,c), aka
‘terminals’P), aka
‘non-terminals’P → ε
P → a
P → b
P → c
P → aPa
P → bPb
P → cPcWe can ‘play’ the context-free grammar to generate strings in the language.
P
⇒ aPa
⇒ acPca
⇒ accPcca
⇒ accbcca
We call such a sequence a derivation.
We say “P generates the string
accbcca”.
We say “P generates the language
{ε,a,b,c,aa,bb,acca,...}”.
S → A
S → B
A → a
A → AA
B → b
B → BBGrammars can have multiple nonterminals.
Question: What are the non-terminals in this grammar?
Question: What is the language generated by
S?
Dig → 0
Dig → 1
Dig → 2
Dig → 3
Dig → 4
Dig → 5
Dig → 6
Dig → 7
Dig → 8
Dig → 9
Abbreviation: write multiple productions on one line with
‘|’
Dig → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
(We still count ten productions!)
Digs → ε | Dig Digs
This grammar allows sequences with leading zeros:
Digs
⇒ Dig Digs
⇒ Dig Dig Digs
⇒ Dig Dig Dig Digs
⇒ Dig Dig Dig ε
⇒* 007
The symbol ⇒* means that we make multiple (zero or more,
but finitely many) derivation steps at once.
We also allow the star notation on the right hand side of a grammar to abbreviate zero or more occurrences of symbols:
Digs → Dig*
To disallow leading zeros we introduce another nonterminal:
PDig → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Nat → 0 | PDig Digs
Sign → + | -
Int → Sign Nat | Nat
The sign is optional.
There is an abbreviation for optional symbols as well:
Int → Sign? Nat
Letters are like digits.
SLetter → a | b | ... | z
CLetter → A | B | ... | Z
(52 productions in total.)
Letter → SLetter | CLetter
In many languages, identifiers must not start with a number, but can have numbers following an initial letter.
Identifier → SLetter AlphaNum*
AlphaNum → Letter | Dig
Variations are easy to define (e.g. allowing ‘_’).
Stat → Var = Expr ;
| if ( Expr ) Stat else Stat
| while ( Expr ) Stat
Expr → Integer
| Var
| Expr Op Expr
Var → Identifier
Op → + | - | *
One nonterminal in the grammar is called the start symbol.
This lets us be lazy and say
the language generated by the grammar
instead of
the lanauge generated by the start symbol
We usually assume the first symbol is the start symbol. The start
symbol is often called S.
The grammars we consider are restricted:
Grammars with this restriction are called context-free.
Some facts about context-free grammars:
Languages that can be described by a context-free grammar are called context-free languages.
Context-free languages are relatively algorithm-friendly
Many languages cannot be described by a context-free grammar.
Many programming languages are (almost but not quite) context-free
Multiple grammars may describe the same language.
S → aS
S → a
S → Sa
S → a
S → SS
S → a
S → AS
S → A
A → a
We call grammars that generate the same language “weakly equivalent”.
Consider the grammar:
S → SS
S → a
These are three derivations of aaa:
S ⇒ SS ⇒ aS ⇒ aSS ⇒ aSa ⇒ aaa
S ⇒ SS ⇒ aS ⇒ aSS ⇒ aaS ⇒ aaa
S ⇒ SS ⇒ Sa ⇒ SSa ⇒ aSa ⇒ aaa
Question: How is the third different from the others?
We can visualize a derivation as a parse tree:
(SEE BOARD)
A grammar where every sentence corresponds to a unique parse tree is called unambiguous.
If this is not the case, the grammar is called ambiguous.
The grammar
S → SS
S → a
is thus ambiguous.
Question: Why are ambiguous grammars bad?
Given the (ambiguous) grammar:
S → S-S
S → 1
The sentence 1-1-1 corresponds to two parse trees:
Evaluating these expressions (where ‘-’ denotes
subtraction):
the left tree corresponds to the value 1,
the right tree corresponds to the value -1.
Hence, ambiguous grammars lead to ambiguous semantics.
Later, we will also see that ambiguous grammars can cause inefficiency.
Who who said ‘-’ performs subtraction?
1-1-1 could mean the value 1 or
-1,
1-1-1 could mean the 1st of January in year
1,
1-1-1 could mean that the first item in a table
should be copied three times,
…
A language defines which sentences are syntactically correct. Processing these sentences (‘semantics’) is a separate step.
S → if b then S else S
| if b then S
| aQuestion: How do you parse
if b then if b then a else a?
Exercise 2.17
All of these grammars describe the same language:
S → aS
S → a
S → Sa
S → a
S → SS
S → a
S → AS
S → A
A → a
Which are ambiguous?
Note: some CF languages have only ambiguous grammars.
A grammar transformation is a mapping from one grammar to another, such that the generated language remains the same.
Formally:
A grammar transformation maps a grammar
Gto another grammarG'such thatLangOf(G) = LangOf(G')
Grammar transformations can help us to transform grammars with undesirable properties (such as ambiguity) into grammars with other (hopefully better) properties.
Most grammar transformations are motivated by facilitating parsing.
Given a grammar G and a string s:
The recognition problem is to decide whether or not
s ∈ LangOf(G)
The parsing problem is to decide whether or not
s ∈ LangOf(G), and provide a parse tree if it is.
Consider this grammar (What is the language? Is it ambiguous?)
S → S-D | D
D → 0 | 1
The string 1-0-1 parses to
Idea: Let us represent nonterminals as datatypes:
In every node of the parse tree, we have a choice between one of the productions for the nonterminal in question.
If we want to build a value of a Haskell datatype, we have a choice between any of that datatype’s constructors.
Hence, productions become constructors.
S → S-D | D
D → 0 | 1
data S = Minus S D | SingleDigit D
data D = Zero | One
Both descibe the language.
The string 1-0-1 corresponds to the abstract syntax
Minus (Minus (SingleDigit One) Zero) One
S → S-D | D
D → 0 | 1
data S = Minus S D | SingleDigit D
data D = Zero | One
A semantic function consumes an abstract syntax. These are very useful for building compilers!
e.g. pretty-printing
printS :: S → String
printS (Minus s d) = printS s ++ "-" ++ printS s
printS (SingleDigit d) = printD d
printD :: D → String
printD Zero = "0"
printD One = "1"
printS (Minus (Minus (SingleDigit One) Zero) One )
evaluates to 1-0-1
S → S-D | D
D → 0 | 1
data S = Minus S D | SingleDigit D
data D = Zero | One
A semantic function consumes an abstract syntax. These are very useful for building compilers!
e.g. evaluation
evalS :: S → Int
evalS (Minus s d) = evalS s - evalD d
evalS (SingleDigit d) = evalD d
evalD :: D → Int
evalD Zero = 0
evalD One = 1
evalS (Minus (Minus (SingleDigit One) Zero) One)
evaluates to 11
Grammar A way to describe a language inductively.
Production A rewrite rule in a grammar.
Weakly equivalent Grammars that generate the same language.
Context-free The class of grammars/languages we consider.
Nonterminal Auxiliary symbols in a grammar.
Terminal Alphabet symbols in a grammar.
Derivation Successively rewriting from a grammar until we reach a sentence.
Parse Tree Tree representation of a derivation.
Ambiguity Multiple parse trees for the same sentence.
Abstract Syntax (Haskell) Datatype corresponding to a grammar.
Semantics Function defined on the abstract syntax.
We use language to talk about other languages.
We may use English words in a formal language; they’re different from the English words we use to talk about them.
We may use Haskell code to implement a language like Haskell, or C#code to implement a language like C#.
Don’t confuse the two levels.
A finite set of symbols.
A set of words/sentences, i.e., sequences of symbols from the alphabet.
A way to define a language by means of rewrite rules.
A proof that a word is in the language described by a grammar.