Lecture 19: Context-Free Grammars

Preface

We wil now introduce the concept of Context-Free Grammars and Context-Free Languages (throughout this document, we may use the abbreviation "CFG" and "CFL" respectively).

So far in the course and in our discussions on formal language theory, we have discussed (speaking broadly) finite state machines and regular languages and the assiociation between the two (that is, FSMs represent some regular language and vise versa). As we venture into the next part of the course, we will be talking about a new type of machine called the Turing Machine. The central difference between the typical FSMs we have been dealing with and Turing Machines is the notion of memory. We won't be talking about that today, but rather, we'll introduce a new form of language (Context-Free Languages which are generated via Context-Free Grammars) and a new form of automata to model it (Pushdown Automata). So in some sense, a CFL is to a Pushdown Automata as a Regular Language is to a Finite State Machine. Interestingly, we'll see that a Pushdown Automata, from the perspective of memory, lies somewhere in between the Finite State Machine and the Turing Machine (and the same comparison can be made from the perspective of the respective languages).


Motivating example

Consider the following expression:

(26+42)2+1 (26+42) * 2+1

Imagine you enter this expression in a calculator on your computer or, say, in Python. How does the computer and Python operator know whether such an expression is syntactically valid?

We might solve this by creating pre-defined templates with fill in the blanks (placeholders) as the ones seen in the popular activity "Mad Libs" (the game where you fill in words for blank spaces representing different parts of speeches).

Like so (with the above example):

\downarrow


Ok... what about arithmetic expressions that don't follow this pattern?

For that, we harness the powers of recursion.

Recursive Mad Libs

Essentially, the main difference here, is whenever we see some placeholders, we can replace it with more placeholders.

Let us introduce the definition of what exactly an arithmetic expression is composed of:

Arithmetic Expressions:
An Arithmetic Expression can be composed of the following elements:

In sum, arithmetic expressions are composed of other expressions connected by operators constricted by a set of rules.

We can generate a CFG for arithmetic expressions as follows:

ExprintExprExpr Op Expr Expr(Expr)Op+OpOp×Op/ \begin{array}{l} \operatorname{Expr} \rightarrow \operatorname{int} \\ \operatorname{Expr} \rightarrow \operatorname{Expr} \text { Op Expr } \\ \operatorname{Expr} \rightarrow(\operatorname{Expr)} \\ O p \rightarrow+ \\ O p \rightarrow- \\ O p \rightarrow \times \\ O p \rightarrow / \\ \end{array}

So recursively speaking, when we see a singular placeholder for an expression, we can replace it with a non-singular (element-wise) expression (this is formally defined as a variant of the "production rule" which is shown below).


A brief note on on "grammars"; grammar vs. language

Before we begin our discussion on CFGs, I'd like to share a quick note. I want to talk about what exactly is a "grammar" and how it relates to a language. We have formally defined languages back in lecture 14, but we never really got around to defining what a "grammar" is which is defined here:

Definition of a Grammar:
A grammar describes how to form strings from a language's alphabet that are valid according to the language's syntax [1][1].

So basically, a grammar is like a manual on how one might form strings that belong to a certain language.

You can think of "grammars" as "language" generators which are modelled using "machines".

Ok... now let us discuss CFGs!

Context-Free Grammars

With the intuition we have just gained (recursive Mad Libs and the notion of grammars), we now introduce context-free grammars.

Context-Free Grammar:
A context-free grammar is a recursive set of rules that define a language.

There are a good amount of requirements and notation we must learn to start building those rules so that is what we will discuss now.

Context-Free Grammar Terminology:

All CFGs are built-up from these four notions. We actually can formally define this mathematically, now that we have the intuition behind us:

Formal Definition of a Context-Free Grammar:
A context-free grammar is defined by 44 tuples as G={V,Σ,S,P}G=\{V, \Sigma, S, P\} where

It is almost analagous to a mathematical expression composed of variables (nonterminals) and all other things (terminals: constants, parantheses, operators, etc.).

A notational shorthand

Traditionally, say we have the following CFG:


we'd represent this mathematically in this format which is much simpler to write out (vertical dividers represent line breaks in the version above):

\downarrow


Some terminology

Derivation:
A sequence of zero or more steps where nonterminals are replaced by the right-hand side of a production is called a derivation.

These are particularly useful when you need to show that a string is in a grammar; simply derive from your start point until you get there.

The Language of a Grammar:
If GG is a CFG with alphabet Σ\Sigma and start symbol S\mathbf{S}, then the language of G\boldsymbol{G} is the set
L(G)={ωΣSω} \mathscr{L}(G)=\left\{\omega \in \Sigma^{*} \mid \mathbf{S} \Rightarrow^{*} \omega\right\}
That is, L(G)\mathscr{L}(G) is the set of strings of terminals derivable from the start symbol.

I think the above definition intuitively makes sense. Be warned though: languages can only consist of strings consisting of only terminals (no non-terminals allowed).

CFGs and regular languages

How are they related?

Every regular language is context-free.

Ok... does this mean RLs == CFLs? Well, we cannot draw that conclusion yet since we have only proved it one way (are all context-free languages also regular languages?).

Actually, in fact, there exist CFGs that generate CFLs that aren't regular languages. As such, we have this heirarchy (Keith calls it the Lava Diagram):


So in some sense, CFGs are more powerful than RegEx's. Why so?

The underlying intuition has to do with memory. Some derivations of strings have "unbounded" memory. Consider the following CFG:

SaSbε \mathrm{S} \rightarrow \mathrm{aSb} \mid \boldsymbol{\varepsilon}

Also see the recursion?!

\downarrow

aaaabbbb \begin{array}{|c|c|c|c|c|c|c|c|} \hline \mathbf{a} & \mathbf{a} & \mathbf{a} & \mathbf{a} & \mathbf{b} & \mathbf{b} & \mathbf{b} & \mathbf{b} \\ \hline \end{array}

\downarrow

L(G)={anbnnN} \mathscr{L}(G)=\left\{\mathbf{a}^{n} \mathbf{b}^{n} \mid n \in \mathbb{N}\right\}

That is our poster child nonregular language!

Designing CFGs

Keith has some words of advice:

I think it is best to just watch/read examples worked out for this. I may jot down some thoughts below (or somewhere else), but the previous sentence still stands.


The bigger picture on grammars, languages, and machines

I like to think of the introduction to the theory of computation as two lines running in parallel to one another. On one line, we have formal language theory. On another, we have mathematical models of machines. More importantly, is that, unlike parallel lines, these two are directly related and help describe one another. A regular language is to a finite state machine, a context-free language is to a pushdown automaton, and so on.

In this lecture in particular, not only did we talk about CFGs and CFLs, we also introduced the concept of "grammars". We have already learned about "regular languages" which, although we didn't denote it at the time, have a grammar cousin called "regular grammars". Overall, a man by the name of Noam Chomsky defined a "hierachy" that relates all of these grammars to machines which is detailed in this table:

 Grammar Accepted  Language Accepted  Automaton  Unrestricted Grammar  Recursively Enumerable Language  Turing Machine  Context Sensitive  Grammar  Context Sensitive Language  Linear Bounded  Automaton  Context Free  Grammar  Context Free Language  Pushdown  Automata  Regular Grammar  Regular Language  Finite State  Automaton  \begin{array}{l|l|l} \text { \textbf{Grammar Accepted} } & \text { \textbf{Language Accepted} } & \text { \textbf{Automaton} } \\ \hline \text { Unrestricted Grammar } & \text { Recursively Enumerable Language } & \text { Turing Machine } \\ \hline \begin{array}{l} \text { Context Sensitive } \\ \text { Grammar } \end{array} & \text { Context Sensitive Language } & \begin{array}{l} \text { Linear Bounded } \\ \text { Automaton } \end{array} \\ \hline \begin{array}{l} \text { Context Free } \\ \text { Grammar } \end{array} & \text { Context Free Language } & \begin{array}{l} \text { Pushdown } \\ \text { Automata } \end{array} \\ \hline \text { Regular Grammar } & \text { Regular Language } & \begin{array}{l} \text { Finite State } \\ \text { Automaton } \end{array} \\ \hline \end{array}


References: