Constructing the Firsts Set of a Context Free Grammar
When constructing a recognizer or parser from a context free grammar there are some properties of the language which must be calculated from the grammar irregardless of the type of parser being developed. Two such properties which make the automatic generation of a parser from a grammar possible are the 'First' and 'Follow' sets.
First and Follow sets of a small expression grammar
If you've read my previous post on direct construction of a DFA from a regular expression then those sets should sound familiar. We are (basically) performing the same process, except where we had previously been working with the AST representation of an augmented regular language, we are now operating on the BNF form of a context free language. Speaking of familiar, the rest of this post makes heavy use of the data structures introduced in this post on context free grammars, so if you haven't read that yet now would be a good time. In todays post I'm going to talk about my implementation of an algorithm for finding the 'Firsts' set of a CFG.
What Are Firsts Anyway?
A symbol, 'q', is considered a First if for some production 'p', there is a derivation which starts with the symbol 'q'. Put another way, if 'y' is a string of symbols then First('y') is the set of symbols which 'y' can begin with. One thing is for certain: you can't say the name is misleading.
Firsts sets are used in the construction of both top-down and bottom-up parsers, including for languages which are LL(1), SLR(1) and LALR(1) parse tables. Thanks to there prolific use, there has been a large body of research into algorithms for computing them.
Calculating 'Firsts'
We begin by initializing a new set for each terminal symbol of our grammar. Each terminal symbol is added to it's own first set. Next, we create a new empty set for each of our grammars non-terminal symbols. If any of our non-terminal symbols are nullable, that is, they derive epsilon, then we add epsilon to it's set firsts set.
void ComputeFirstSets::initFirsts(Grammar& G) {
//For all terminal symbols, first(t) -> {t}
for (Symbol t : G.terminals) {
G.firsts[t] = {t};
}
//create a first set for each non terminal,
//if the non terminal can derive epsilon, add epsilon to its set.
for (Symbol nt : G.nonterminals) {
G.firsts[nt] = set<Symbol>();
if (G.isNullable(nt)) {
G.firsts[nt].insert(EPS);
}
}
}
The algorithm itsself is rather straight forward. It repeatedly loops over the grammar, building up the sets iteratively over each pass until no more changes are made at which point we are done. I will concede that If you've never encountered an algorithm with free-point iteration ("loop until no more changes occur") it can be a little unsettling at first, though it is a common design pattern in compiler construction.
void ComputeFirstSets::compute(Grammar& G) {
// Initialize FIRST sets for terminals and non-terminals
initFirsts(G);
// Iterative algorithm to compute FIRST sets until no changes occur
bool didchange = false;
do {
for (auto prod : G.productions) {
Symbol X = prod.first;
for (Production production : prod.second) {
if (propagate(G, X, production.rhs))
didchange = true;
}
}
} while (didchange);
}
In general, there are two special cases which require extra attention: If a symbol can derive epsilon, and if a given symbol is an action symbol. We wont concern ourselves with action symbols at the moment, just keep in mind that if one is encountered, it is simply ignored by the first and follows calculations.
Epsilon is a different story. If the first symbol of the right hand side of given productuion is epsilon, we add epsilon to the firsts set of the current productions non-terminal. When we call propagate() on X and it is a valid symbol (not epsilon or an action symbol) We do one of two things based on if the symbol is a terminal or non-terminal symbol.
bool ComputeFirstSets::propagate(Grammar& G, Symbol X, SymbolString& production) {
bool didchange = false;
// Check if the first symbol of this production is epsilon
Symbol firstSymbol = production.front();
if (firstSymbol == EPS) {
G.firsts[X].insert(EPS);
} else {
// If f is a non-terminal
if (G.firsts.find(firstSymbol) != G.firsts.end()) {
// Add all symbols from FIRST(firstSymbol) to FIRST(X), except epsilon if f is nullable (FIRST(f) - 'E')
didchange = updateNonTerminal(G, X, firstSymbol);
} else if (G.firsts[X].find(firstSymbol) == G.firsts[X].end()) {
G.firsts[X].insert(firstSymbol);
didchange = true; // Mark that a change occurred
}
}
return didchange;
}
If we encounter a terminal symbol we place the first symbol from the right hand side of the production into X's first set.
If the symbol encountered is a non-terminal than we add every symbol of the productions right hand side, with the exception of epsilon if present, into X's set. This is handled by the updateNonTerminal() method:
bool ComputeFirstSets::updateNonTerminal(Grammar& G, Symbol X, Symbol f) {
bool didchange = false;
for (auto k : G.firsts[f]) {
if (G.firsts[X].find(k) == G.firsts[X].end() && k != EPS) {
G.firsts[X].insert(k);
didchange = true; // Mark that a change occurred
}
}
return didchange;
}
-
Constructing the Firsts Set of a Context Free Grammar
-
Inchworm Evaluation, Or Evaluating Prefix-Expressions with a Queue
-
Data Structures For Representing Context Free Grammar
-
A B Tree of Binary Search Trees
-
Implementing enhanced for loops in Bytecode
-
Top-Down Deletion for Red/Black Trees
-
Function Closures For Bytecode VMs: Heap Allocated Activation Records & Access Links
-
Pascal & Bernoulli & Floyd: Triangles
-
A Quick tour of MGCLex
-
Compiling Regular Expressions for "The VM Approach"
Leave A Comment