Constructing the Firsts Set of a Context Free Grammar

mgcadmin 03-01-2026

When constructing a recognizer or parser from a context free grammar there are some properties of the language which must be calculated from the grammar irregardless of the type of parser being developed. Two such properties which make the automatic generation of a parser from a grammar possible are the 'First' and 'Follow' sets.

First and Follow sets of a small expression grammar

If you've read my previous post on direct construction of a DFA from a regular expression then those sets should sound familiar. We are (basically) performing the same process, except where we had previously been working with the AST representation of an augmented regular language, we are now operating on the BNF form of a context free language. Speaking of familiar, the rest of this post makes heavy use of the data structures introduced in this post on context free grammars, so if you haven't read that yet now would be a good time. In todays post I'm going to talk about my implementation of an algorithm for finding the 'Firsts' set of a CFG.

What Are Firsts Anyway?

A symbol, 'q', is considered a First if for some production 'p', there is a derivation which starts with the symbol 'q'. Put another way, if 'y' is a string of symbols then First('y') is the set of symbols which 'y' can begin with. One thing is for certain: you can't say the name is misleading.

Firsts sets (along with their follow sets) are used in the construction of both top-down and bottom-up parsers, including for languages which are LL(1), SLR(1) and LALR(1) parse tables. Thanks to this utilization use, there has been a large body of research into algorithms for computing them. Any book on compilers, and many, many books on programming language theory include algorithms for their calculation, which we will look to next.

Calculating 'Firsts'

The algorithm presented here is an implementation of that described in "Compilers" By Aho, Sethi, and Ullman[1]. We begin by initializing a new set for each terminal symbol of our grammar. Each terminal symbol is added to it's own first set. Next, we create a new empty set for each of our grammars non-terminal symbols. If any of our non-terminal symbols are nullable, that is, they derive epsilon, then we add epsilon to it's set firsts set.

void  ComputeFirstSets::initFirsts(Grammar& G) {
    //For all terminal symbols, first(t) -> {t}
    for (Symbol t : G.terminals) {
        G.firsts[t] = {t};
    }
    //create a first set for each non terminal, 
    //if the non terminal can derive epsilon, add epsilon to its set.
    for (Symbol nt : G.nonterminals) {
        G.firsts[nt] = set<Symbol>();
        if (G.isNullable(nt)) {
            G.firsts[nt].insert(EPS);
        }
    }
}

The algorithm itsself is rather straight forward. It repeatedly loops over the grammar, building up the sets iteratively over each pass until no more changes are made at which point we are done. I will concede that If you've never encountered an algorithm with free-point iteration ("loop until no more changes occur") it can be a little unsettling at first, though it is a common design pattern in compiler construction.

void ComputeFirstSets::compute(Grammar& G) {
    // Initialize FIRST sets for terminals and non-terminals
    initFirsts(G);
    
    // Iterative algorithm to compute FIRST sets until no changes occur
    bool didchange = false;
    do {
          for (auto prod : G.productions) {
                Symbol X = prod.first; 
                for (Production production : prod.second) {
                      if (propagate(G, X, production.rhs))
                           didchange = true;
                }
          }   
    } while (didchange); 
}

In general, there are two special cases which require extra attention: If a symbol can derive epsilon, and if a given symbol is an action symbol. We wont concern ourselves with action symbols at the moment, just keep in mind that if one is encountered, it is simply ignored by the first and follows calculations.

Epsilon is a different story. If the first symbol of the right hand side of given productuion is epsilon, we add epsilon to the firsts set of the current productions non-terminal. When we call propagate() on X and it is a valid symbol (not epsilon or an action symbol) We do one of two things based on if the symbol is a terminal or non-terminal symbol.

bool ComputeFirstSets::propagate(Grammar& G, Symbol X, SymbolString& production) {
    bool didchange = false;
     // Check if the first symbol of this production is epsilon
    Symbol firstSymbol = production.front();
     if (firstSymbol == EPS) {
        G.firsts[X].insert(EPS);
    } else {
        // If f is a non-terminal
        if (G.firsts.find(firstSymbol) != G.firsts.end()) {
            // Add all symbols from FIRST(firstSymbol) to FIRST(X), except epsilon if f is nullable (FIRST(f) - 'E')
            didchange = updateNonTerminal(G, X, firstSymbol);
        } else if (G.firsts[X].find(firstSymbol) == G.firsts[X].end()) {
            G.firsts[X].insert(firstSymbol);
            didchange = true;  // Mark that a change occurred
        }
    }
    return didchange;
}

If we encounter a terminal symbol we place the first symbol from the right hand side of the production into X's first set.

If the symbol encountered is a non-terminal than we add every symbol of the productions right hand side, with the exception of epsilon if present, into X's set. This is handled by the updateNonTerminal() method:

bool ComputeFirstSets::updateNonTerminal(Grammar& G, Symbol X, Symbol f) {
    bool didchange = false;
    for (auto k : G.firsts[f]) {
        if (G.firsts[X].find(k) == G.firsts[X].end() && k != EPS) {
            G.firsts[X].insert(k);
            didchange = true;  // Mark that a change occurred
        }
    }
    return didchange;
}

A Run Through a Simple Expression Grammar

The following grammar is an LL(1) expression grammar suitable for being consumed by MGCPGen. It has the 4 primary arithmateic operators, along with unary minus, assignment, and less/greater than operators. Using this grammar we can trace the execution of the above algorithms as the firsts sets are computed.

stmt ::= expr
expr ::= assign-expr exprT
exprT ::= TK_ASSIGN assign-expr @@ exprT
exprT ::= #
assign-expr ::= rel-expr assign-exprT
assign-exprT ::= TK_LT rel-expr @@ assign-exprT
assign-exprT ::= TK_GT rel-expr @@ assign-exprT
assign-exprT ::= #
rel-expr ::= term rel-exprT
rel-exprT ::= TK_PLUS term @@ rel-exprT
rel-exprT ::= TK_MINUS term @@ rel-exprT
rel-exprT ::= #
term ::= factor termT
termT ::= TK_MUL factor @@ termT
termT ::= TK_DIV factor @@ termT
termT ::= #
factor ::= TK_MINUS factor @@
factor ::= primary
primary ::= TK_ID @@
primary ::= TK_NUM @@
primary ::= TK_LPAREN expr TK_RPAREN @@

The supplied grammar requires four passes to propagate the changes to all of the sets. I have separated out what each set looks like after each iteration so we can see how they change:

Iteration 1:

FIRST(#): { }
FIRST(assign-expr): { }
FIRST(assign-exprT): { # TK_GT TK_LT }
FIRST(expr): { }
FIRST(exprT): { # TK_ASSIGN }
FIRST(factor): { TK_MINUS }
FIRST(primary): { TK_ID TK_LPAREN TK_NUM }
FIRST(rel-expr): { }
FIRST(rel-exprT): { # TK_MINUS TK_PLUS }
FIRST(stmt): { }
FIRST(term): { TK_MINUS }
FIRST(termT): { # TK_DIV TK_MUL }

Iteration 2:

FIRST(#): { }
FIRST(assign-expr): { }
FIRST(assign-exprT): { # TK_GT TK_LT }
FIRST(expr): { }
FIRST(exprT): { # TK_ASSIGN }
FIRST(factor): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(primary): { TK_ID TK_LPAREN TK_NUM }
FIRST(rel-expr): { TK_MINUS }
FIRST(rel-exprT): { # TK_MINUS TK_PLUS }
FIRST(stmt): { }
FIRST(term): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(termT): { # TK_DIV TK_MUL }

Iteration 3:

FIRST(#): { }
FIRST(assign-expr): { TK_MINUS }
FIRST(assign-exprT): { # TK_GT TK_LT }
FIRST(expr): { TK_MINUS }
FIRST(exprT): { # TK_ASSIGN }
FIRST(factor): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(primary): { TK_ID TK_LPAREN TK_NUM }
FIRST(rel-expr): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(rel-exprT): { # TK_MINUS TK_PLUS }
FIRST(stmt): { TK_MINUS }
FIRST(term): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(termT): { # TK_DIV TK_MUL }

Iteration 4:

FIRST(#): { }
FIRST(assign-expr): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(assign-exprT): { # TK_GT TK_LT }
FIRST(expr): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(exprT): { # TK_ASSIGN }
FIRST(factor): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(primary): { TK_ID TK_LPAREN TK_NUM }
FIRST(rel-expr): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(rel-exprT): { # TK_MINUS TK_PLUS }
FIRST(stmt): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(term): { TK_ID TK_LPAREN TK_MINUS TK_NUM }
FIRST(termT): { # TK_DIV TK_MUL }

So there you have it, the calculation of first sets from a context free grammar. Until next time, Happy Hacking!

MaxGCoding.com

For the love of programming

Constructing the Firsts Set of a Context Free Grammar

What Are Firsts Anyway?

Calculating 'Firsts'

A Run Through a Simple Expression Grammar

Further Reading

Leave A Comment

Resolving Shift/Reduce Conflicts With Operator Precedence

Squeezing DFAs with Pair Compression

Designing Abstract Syntax Trees: Homogenous vs. Heterogenous Node Structures

From LR Items to LR States

Calculating Follow Sets of a Context Free Grammar

Streaming Images from ESP32-CAM for viewing on a CYD-esp32

Constructing the Firsts Set of a Context Free Grammar

Inchworm Evaluation, Or Evaluating Prefix-Expressions with a Queue

Data Structures For Representing Context Free Grammar

A B Tree of Binary Search Trees