Lecture 19: Context-Free Grammars

Preface

We wil now introduce the concept of Context-Free Grammars and Context-Free Languages (throughout this document, we may use the abbreviation "CFG" and "CFL" respectively).

So far in the course and in our discussions on formal language theory, we have discussed (speaking broadly) finite state machines and regular languages and the assiociation between the two (that is, FSMs represent some regular language and vise versa). As we venture into the next part of the course, we will be talking about a new type of machine called the Turing Machine. The central difference between the typical FSMs we have been dealing with and Turing Machines is the notion of memory. We won't be talking about that today, but rather, we'll introduce a new form of language (Context-Free Languages which are generated via Context-Free Grammars) and a new form of automata to model it (Pushdown Automata). So in some sense, a CFL is to a Pushdown Automata as a Regular Language is to a Finite State Machine. Interestingly, we'll see that a Pushdown Automata, from the perspective of memory, lies somewhere in between the Finite State Machine and the Turing Machine (and the same comparison can be made from the perspective of the respective languages).

Motivating example

Consider the following expression:

$(26+42) * 2+1$

Imagine you enter this expression in a calculator on your computer or, say, in Python. How does the computer and Python operator know whether such an expression is syntactically valid?

We might solve this by creating pre-defined templates with fill in the blanks (placeholders) as the ones seen in the popular activity "Mad Libs" (the game where you fill in words for blank spaces representing different parts of speeches).

Like so (with the above example):

$\downarrow$

Ok... what about arithmetic expressions that don't follow this pattern?

For that, we harness the powers of recursion.

Recursive Mad Libs

Essentially, the main difference here, is whenever we see some placeholders, we can replace it with more placeholders.

Let us introduce the definition of what exactly an arithmetic expression is composed of:

Arithmetic Expressions:
An Arithmetic Expression can be composed of the following elements:
int - A single number
Expression Operator Expression - Two expressions joined by an operator.
(Expression) - An expression wrapped in parantheses.
In sum, arithmetic expressions are composed of other expressions connected by operators constricted by a set of rules.
We can generate a CFG for arithmetic expressions as follows:
$\begin{array}{l} \operatorname{Expr} \rightarrow \operatorname{int} \\ \operatorname{Expr} \rightarrow \operatorname{Expr} \text { Op Expr } \\ \operatorname{Expr} \rightarrow(\operatorname{Expr)} \\ O p \rightarrow+ \\ O p \rightarrow- \\ O p \rightarrow \times \\ O p \rightarrow / \\ \end{array}$

So recursively speaking, when we see a singular placeholder for an expression, we can replace it with a non-singular (element-wise) expression (this is formally defined as a variant of the "production rule" which is shown below).

A brief note on on "grammars"; grammar vs. language

Before we begin our discussion on CFGs, I'd like to share a quick note. I want to talk about what exactly is a "grammar" and how it relates to a language. We have formally defined languages back in lecture 14, but we never really got around to defining what a "grammar" is which is defined here:

Definition of a Grammar:
A grammar describes how to form strings from a language's alphabet that are valid according to the language's syntax $[1]$ .

So basically, a grammar is like a manual on how one might form strings that belong to a certain language.

You can think of "grammars" as "language" generators which are modelled using "machines".

Ok... now let us discuss CFGs!

Context-Free Grammars

With the intuition we have just gained (recursive Mad Libs and the notion of grammars), we now introduce context-free grammars.

Context-Free Grammar:
A context-free grammar is a recursive set of rules that define a language.

There are a good amount of requirements and notation we must learn to start building those rules so that is what we will discuss now.

Context-Free Grammar Terminology:
Production Rule - example is like the one aboe: "if you see Expr, you can replace it with Expr Op Expr". More generally, a singular production rule is a rule saying how each nonterminal can be replaced by a string of terminals and nonterminals for the language at hand (i.e. $\texttt{Expr} \rightarrow \texttt{Expr Op Expr}$ )... so the rule states you can replace everything on the left hand side with the things on right hand side of the arrow.
Operator - connects expressions; arithmetical examples are +, -, *, /.
Nonterminals - these are parts of the expression which are dynamic placeholders (variables) meaning they are not part of any final input string, but serve to define the structure of the "template".
Terminals - on the other hand, we have terminals which are the final characters used in the string and do not get replaced at some point in the future like nonterminals do. The terminal symbols make up the alphabet of the CFG.

All CFGs are built-up from these four notions. We actually can formally define this mathematically, now that we have the intuition behind us:

Formal Definition of a Context-Free Grammar:
A context-free grammar is defined by $4$ tuples as $G=\{V, \Sigma, S, P\}$ where
$V=$ Set of Non-Terminals (also referred to as variables)
$\Sigma=$ Set of Terminal Symbols
$S=$ Start Symbol
$\mathrm{P}=$ Set of Production Rules

It is almost analagous to a mathematical expression composed of variables (nonterminals) and all other things (terminals: constants, parantheses, operators, etc.).

The point that we should stress here is that we should think of CFGs as "generators" which generate the strings meeting the requirements of a specific CFL.
- We define a "recursive" Mad-Lib-like "template" with a set of production rules (we generally define our production rules given the context of the CFG we want to make).
- Based on these rules and the nonterminals, we can generate the strings for the CFL by filling in the nonterminals with terminals.

A notational shorthand

Traditionally, say we have the following CFG:

we'd represent this mathematically in this format which is much simpler to write out (vertical dividers represent line breaks in the version above):

$\downarrow$

Some terminology

Derivation:
A sequence of zero or more steps where nonterminals are replaced by the right-hand side of a production is called a derivation.
Form - for a $\Rightarrow^*$ b, we say a derives b.
Example - Expr $\Rightarrow^*$ int x (int + int)

These are particularly useful when you need to show that a string is in a grammar; simply derive from your start point until you get there.

The Language of a Grammar:
If $G$ is a CFG with alphabet $\Sigma$ and start symbol $\mathbf{S}$ , then the language of $\boldsymbol{G}$ is the set
$\mathscr{L}(G)=\left\{\omega \in \Sigma^{*} \mid \mathbf{S} \Rightarrow^{*} \omega\right\}$
That is, $\mathscr{L}(G)$ is the set of strings of terminals derivable from the start symbol.

I think the above definition intuitively makes sense. Be warned though: languages can only consist of strings consisting of only terminals (no non-terminals allowed).

CFGs and regular languages

How are they related?

In an intuitive sense, you can represent a regular language as a CFL. Think of RegEx's. You have your atomic base operators ( $*$ $*$ , $\cup$ $\cup$ ) which represent some functionality. Although we don't have those exact operators in CFGs and by extension, CFLs, we can represent the functionality in the context of CFG terminologies.
- As such, we have the theorem:

Every regular language is context-free.

Ok... does this mean RLs $=$ CFLs? Well, we cannot draw that conclusion yet since we have only proved it one way (are all context-free languages also regular languages?).

Actually, in fact, there exist CFGs that generate CFLs that aren't regular languages. As such, we have this heirarchy (Keith calls it the Lava Diagram):

So in some sense, CFGs are more powerful than RegEx's. Why so?

The underlying intuition has to do with memory. Some derivations of strings have "unbounded" memory. Consider the following CFG:

$\mathrm{S} \rightarrow \mathrm{aSb} \mid \boldsymbol{\varepsilon}$

Also see the recursion?!

$\downarrow$

$\begin{array}{|c|c|c|c|c|c|c|c|} \hline \mathbf{a} & \mathbf{a} & \mathbf{a} & \mathbf{a} & \mathbf{b} & \mathbf{b} & \mathbf{b} & \mathbf{b} \\ \hline \end{array}$

$\downarrow$

$\mathscr{L}(G)=\left\{\mathbf{a}^{n} \mathbf{b}^{n} \mid n \in \mathbb{N}\right\}$

That is our poster child nonregular language!

Designing CFGs

Keith has some words of advice:

Think recursively (inductively): Build up bigger structures from smaller ones.
Have a construction plan: Know in what order you will build up the string.
Store information in nonterminals: Have each nonterminal correspond to some useful piece of information.

I think it is best to just watch/read examples worked out for this. I may jot down some thoughts below (or somewhere else), but the previous sentence still stands.

The bigger picture on grammars, languages, and machines

I like to think of the introduction to the theory of computation as two lines running in parallel to one another. On one line, we have formal language theory. On another, we have mathematical models of machines. More importantly, is that, unlike parallel lines, these two are directly related and help describe one another. A regular language is to a finite state machine, a context-free language is to a pushdown automaton, and so on.

In this lecture in particular, not only did we talk about CFGs and CFLs, we also introduced the concept of "grammars". We have already learned about "regular languages" which, although we didn't denote it at the time, have a grammar cousin called "regular grammars". Overall, a man by the name of Noam Chomsky defined a "hierachy" that relates all of these grammars to machines which is detailed in this table:

$\begin{array}{l|l|l} \text { \textbf{Grammar Accepted} } & \text { \textbf{Language Accepted} } & \text { \textbf{Automaton} } \\ \hline \text { Unrestricted Grammar } & \text { Recursively Enumerable Language } & \text { Turing Machine } \\ \hline \begin{array}{l} \text { Context Sensitive } \\ \text { Grammar } \end{array} & \text { Context Sensitive Language } & \begin{array}{l} \text { Linear Bounded } \\ \text { Automaton } \end{array} \\ \hline \begin{array}{l} \text { Context Free } \\ \text { Grammar } \end{array} & \text { Context Free Language } & \begin{array}{l} \text { Pushdown } \\ \text { Automata } \end{array} \\ \hline \text { Regular Grammar } & \text { Regular Language } & \begin{array}{l} \text { Finite State } \\ \text { Automaton } \end{array} \\ \hline \end{array}$

References:

$[1]$ "Formal grammar." Wikipedia.