Grammars and Parsing

Previous lecture

Alphabet

A finite set of symbols.

Language

A set of words/sentences, i.e., sequences of symbols from the alphabet.

Different ways to define languages:

This lecture: inductive definitions, with less verbosity!

Palindromes, concisely

Let PAL be the smallest language such that:

  • ε is in PAL
  • a, b, c are in PAL,
  • if P is in PAL, then aPa, bPb, and cPc are also in PAL.
P → ε
P → a 
P → b 
P → c
P → aPa
P → bPb 
P → cPc

The new notation is called a (formal) grammar.

The grammar contains:

Derivation

P → ε
P → a 
P → b 
P → c
P → aPa
P → bPb 
P → cPc

We can ‘play’ the context-free grammar to generate strings in the language.

      P
⇒    aPa
⇒   acPca
⇒  accPcca
⇒  accbcca

We call such a sequence a derivation.

We say “P generates the string accbcca”.

We say “P generates the language {ε,a,b,c,aa,bb,acca,...}”.

Multiple nonterminals

S → A 
S → B 
A → a 
A → AA 
B → b
B → BB

Grammars can have multiple nonterminals.

Question: What are the non-terminals in this grammar?

Question: What is the language generated by S?

Examples

Language of (single) digits

Dig → 0
Dig → 1
Dig → 2
Dig → 3
Dig → 4
Dig → 5
Dig → 6
Dig → 7
Dig → 8
Dig → 9

Abbreviation: write multiple productions on one line with ‘|

Dig → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

(We still count ten productions!)

Sequences of digits

Digs → ε | Dig Digs

This grammar allows sequences with leading zeros:

    Digs
⇒   Dig Digs
⇒   Dig Dig Digs
⇒   Dig Dig Dig Digs 
⇒   Dig Dig Dig ε
⇒*  007

The symbol ⇒* means that we make multiple (zero or more, but finitely many) derivation steps at once.

We also allow the star notation on the right hand side of a grammar to abbreviate zero or more occurrences of symbols:

Digs → Dig*

Natural numbers

To disallow leading zeros we introduce another nonterminal:

PDig → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Nat → 0 | PDig Digs

Integers

Sign → + | -
Int → Sign Nat | Nat

The sign is optional.

There is an abbreviation for optional symbols as well:

Int → Sign? Nat

Letters

Letters are like digits.

SLetter → a | b | ... | z 
CLetter → A | B | ... | Z

(52 productions in total.)

Letter → SLetter | CLetter

Identifiers

In many languages, identifiers must not start with a number, but can have numbers following an initial letter.

Identifier → SLetter AlphaNum*
AlphaNum → Letter | Dig

Variations are easy to define (e.g. allowing ‘_’).

A fragment of C

Stat → Var = Expr ;
     | if ( Expr ) Stat else Stat
     | while ( Expr ) Stat

Expr → Integer
     | Var
     | Expr Op Expr

Var → Identifier

Op → + | - | *

Grammar theory

Start symbol

One nonterminal in the grammar is called the start symbol.

This lets us be lazy and say

the language generated by the grammar

instead of

the lanauge generated by the start symbol

We usually assume the first symbol is the start symbol. The start symbol is often called S.

Context-free grammars

The grammars we consider are restricted:

Grammars with this restriction are called context-free.

Some facts about context-free grammars:

Multiple grammars for one language

Multiple grammars may describe the same language.

S → aS 
S → a
S → Sa 
S → a
S → SS 
S → a
S → AS 
S → A 
A → a

We call grammars that generate the same language “weakly equivalent”.

Ambiguity, Parsing

Multiple derivations for one sentence

Consider the grammar:

S → SS 
S → a

These are three derivations of aaa:

S ⇒ SS ⇒ aS ⇒ aSS ⇒ aSa ⇒ aaa
S ⇒ SS ⇒ aS ⇒ aSS ⇒ aaS ⇒ aaa
S ⇒ SS ⇒ Sa ⇒ SSa ⇒ aSa ⇒ aaa

Question: How is the third different from the others?

Parse trees

We can visualize a derivation as a parse tree:

(SEE BOARD)

Ambiguity

A grammar where every sentence corresponds to a unique parse tree is called unambiguous.

If this is not the case, the grammar is called ambiguous.

The grammar

S → SS 
S → a

is thus ambiguous.

Question: Why are ambiguous grammars bad?

Ambiguity and parse trees

Given the (ambiguous) grammar:

S → S-S 
S → 1

The sentence 1-1-1 corresponds to two parse trees:

Ambiguity and semantics

Evaluating these expressions (where ‘-’ denotes subtraction):

Hence, ambiguous grammars lead to ambiguous semantics.

Later, we will also see that ambiguous grammars can cause inefficiency.

Beware: syntax vs. semantics

Who who said ‘-’ performs subtraction?

A language defines which sentences are syntactically correct. Processing these sentences (‘semantics’) is a separate step.

Famous Example: Dangling else

S → if b then S else S
  | if b then S
  | a

Question: How do you parse

if b then if b then a else a?

Exercise 2.17

Ambiguity is a property of grammars

All of these grammars describe the same language:

S → aS 
S → a
S → Sa 
S → a
S → SS 
S → a
S → AS 
S → A 
A → a

Which are ambiguous?

Note: some CF languages have only ambiguous grammars.

Grammar transformations

A grammar transformation is a mapping from one grammar to another, such that the generated language remains the same.

Formally:

A grammar transformation maps a grammar G to another grammar G' such that

LangOf(G) = LangOf(G')

Grammar transformations can help us to transform grammars with undesirable properties (such as ambiguity) into grammars with other (hopefully better) properties.

Most grammar transformations are motivated by facilitating parsing.

Parsing, concrete vs abstract syntax

Formally

Given a grammar G and a string s:

Parse trees in Haskell

Consider this grammar (What is the language? Is it ambiguous?)

S → S-D | D 
D → 0 | 1

The string 1-0-1 parses to

Parse trees in Haskell – contd.

Idea: Let us represent nonterminals as datatypes:

Hence, productions become constructors.

Concrete and abstract syntax

Concrete

S → S-D | D 
D → 0 | 1

Abstract

data S = Minus S D | SingleDigit D 
data D = Zero | One

Both descibe the language.

The string 1-0-1 corresponds to the abstract syntax

Minus (Minus (SingleDigit One) Zero) One

Semantic functions

Concrete

S → S-D | D 
D → 0 | 1

Abstract

data S = Minus S D | SingleDigit D 
data D = Zero | One

A semantic function consumes an abstract syntax. These are very useful for building compilers!

e.g. pretty-printing

printS :: S → String 
printS (Minus s d) = printS s ++ "-" ++ printS s
printS (SingleDigit d) = printD d 

printD :: D → String 
printD Zero = "0" 
printD One = "1"
printS (Minus (Minus (SingleDigit One) Zero) One )

evaluates to 1-0-1

Semantic functions

Concrete

S → S-D | D 
D → 0 | 1

Abstract

data S = Minus S D | SingleDigit D 
data D = Zero | One

A semantic function consumes an abstract syntax. These are very useful for building compilers!

e.g. evaluation

evalS :: S → Int 
evalS (Minus s d) = evalS s - evalD d 
evalS (SingleDigit d) = evalD d 

evalD :: D → Int 
evalD Zero = 0
evalD One = 1
evalS (Minus (Minus (SingleDigit One) Zero) One)

evaluates to 11

Summary

Grammar A way to describe a language inductively.

Production A rewrite rule in a grammar.

Weakly equivalent Grammars that generate the same language.

Context-free The class of grammars/languages we consider.

Nonterminal Auxiliary symbols in a grammar.

Terminal Alphabet symbols in a grammar.

Derivation Successively rewriting from a grammar until we reach a sentence.

Parse Tree Tree representation of a derivation.

Ambiguity Multiple parse trees for the same sentence.

Abstract Syntax (Haskell) Datatype corresponding to a grammar.

Semantics Function defined on the abstract syntax.

Beware: objectmeta language

We use language to talk about other languages.

Summary

Alphabet

A finite set of symbols.

Language

A set of words/sentences, i.e., sequences of symbols from the alphabet.

Grammar

A way to define a language by means of rewrite rules.

Parse tree

A proof that a word is in the language described by a grammar.