Data Structures For Representing Context Free Grammar

A context free grammar is defined as a 4-tuple made up of a start symbol, sets, mainly the set of Terminal symbols, the set of Non-Terminal symbols, and the set of productions for deriving the language specified by the grammar. From this 4-tuple additional functions/sets can be computed from the grammar which is used for generating recognizers/parsers for said grammar.

Depending on what type/what it is you want to do with your grammar, some useful properties computed from the context free grammar include the "First" Set and the "Follow" set for the language in question. Computing these are needed for creating both LL and LR based parsers. 

struct Grammar {
    Symbol start;
    set<Symbol> terminals;
    set<Symbol> nonterminals;
    map<Symbol, ProductionSet> productions;
    map<Symbol, set<Symbol>> firsts;
    map<Symbol, set<Symbol>> follow;
    Grammar() {
        
    }
    bool isNonTerminal(Symbol s) {
        return nonterminals.find(s) != nonterminals.end();
    }
    bool isTerminal(Symbol s) {
        return terminals.find(s) != terminals.end();
    }
    bool isNullable(Symbol nt) {
        for (auto prod : productions[nt]) {
            if (prod.rhs[0] == EPS)
                return true;
        }
        return false;
    }
};

Todays post is inspired by the development of mgcpgen, a parser generator designed to work with mgclex, my lexer generator. In it I'm going to discuss some the data structures used by mgcpgen (and CFG's in general) in order to go from an LL(1) grammar specification, to a working parser outputing abstract syntax trees. Lets do it.

Symbolic Computation Requires Symbols

I'm not trying to state the obvious, but to perform symbolic computation, we will in fact need a way to represent the symbols. Common choices include enumerated types, integers, characters, and of course character strings.

using Symbol = string;

Using a typedef allows us to easily change our representation should we decide to switch. 

Representing Strings of Symbols

Just as character strings ase comprised of lists of characters, so SymbolStrings are made up of lists of symbols. SymbolStrings contain the list of terminal and non terminal symbols which make up a multi-symbol production.

struct SymbolString : vector<Symbol> {
    SymbolString(vector<Symbol> ss) {
        for (auto m : ss) {
            this->push_back(m);
        }
    }
    SymbolString() {

    }
    SymbolString(const SymbolString& ss) {
        for (auto m : ss) {
            this->push_back(m);
        }
    }
    SymbolString subString(int startIndex) {
        SymbolString result;
        for (int i = startIndex; i < this->size(); i++)
            result.push_back((*this)[i]);
        return result;
    }
    SymbolString& operator=(const SymbolString& ss) {
        if (this != &ss) {
            for (auto m : ss) {
                this->push_back(m);
            }
        }
        return *this;
    }
    string toString() {
        string str = "";
        for (Symbol s : *this) {
            str += s + " ";
        }
        return str;
    }
    bool operator==(const SymbolString& ss) {
        auto oit = ss.begin();
        auto it = this->begin();
        while (oit != ss.end() && it != this->end()) {
            if (*oit != *it)
                return false;
        }
        return (oit == ss.end() && it == this->end());
    }
    bool operator!=(const SymbolString& ss) {
        return !(*this == ss);
    }
};

Productions and Sets of Productions

Productions are the "rules" which get applied to the associated non-terminal. 

struct Production {
    int pid; //for executing Actions
    Symbol lhs;
    SymbolString rhs;
    Production(int id, Symbol l, SymbolString r) : pid(id), lhs(l), rhs(r) { }
    Production() { }
    string toString() {
        return lhs + " ::= " + rhs.toString(); 
    }
    bool operator==(const Production& op) {
        return lhs == op.lhs && rhs == op.rhs;
    }
    bool operator!=(const Production& op) {
        return !(*this==op);
    }
    bool operator<(const Production& op) const {
        return this->lhs < op.lhs;
    }
};

struct ProductionSet : vector<Production> {
    ProductionSet(vector<Production> rhs) {
        for (Production ss : rhs) {
            push_back(ss);
        }
    }
    ProductionSet() {

    }
};

A Grammar With Expressions & Statements

As an example of pulling everything together into a full grammar, the follow demonstrates how the above data structures are combined to form our grammar. As one can imagine, automating the construction of the Grammar data structure from a specification file is highly encouraged.

    Grammar G;
    G.nonterminals = { "stmt-seq", "stmt-seqT", "stmt", "print-stmt", "print-stmtT", "expr-stmt", "expr", "exprT", "term", "termT", "factor", "primary"};
    G.terminals = { "TK_ID", "TK_NUM", "TK_LPAREN", "TK_RPAREN", "TK_PRINT", "TK_PLUS", "TK_MUL","TK_MINUS", "TK_DIV", "TK_SEMI", "$$", EPS };
    G.productions["stmt-seq"] =     ProductionSet({Production(1,"stmt-seq",SymbolString({"stmt", "stmt-seqT"}))});

    G.productions["stmt-seqT"] =    ProductionSet({Production(2,"stmt-seqT",SymbolString({ "TK_SEMI", "stmt-seq"})), 
                                                   Production(3,"stmt-seqT",SymbolString({EPS}))});

    G.productions["stmt"] =         ProductionSet({Production(4,"stmt",SymbolString({"expr-stmt"})), 
                                                   Production(5,"stmt",SymbolString({"print-stmt"})),
                                                Production(22, "stmt", SymbolString({EPS}))});

    G.productions["print-stmt"] =   ProductionSet({Production(6,"print-stmt",SymbolString({"TK_PRINT", "print-stmtT"}))});
    G.productions["print-stmtT"] = ProductionSet({Production(7,"print-stmtT",SymbolString({"expr"}))});
    G.productions["expr-stmt"] =    ProductionSet({Production(8,"expr-stmt",SymbolString({"expr"}))});
    
    G.productions["expr"] =         ProductionSet({Production(9,"expr",SymbolString({"term", "exprT"}))});
    
    G.productions["exprT"] =        ProductionSet({Production(10,"exprT",SymbolString({"TK_PLUS", "term", "exprT"})), 
                                                   Production(11,"exprT",SymbolString({"TK_MINUS", "term", "exprT"})), 
                                                   Production(12,"exprT",SymbolString({EPS}))});

    G.productions["term"] =         ProductionSet({Production(13,"term",SymbolString({"factor","termT"}))});
    
    G.productions["termT"] =        ProductionSet({Production(14,"termT",SymbolString({"TK_MUL", "factor", "termT"})), 
                                                   Production(15,"termT",SymbolString({"TK_DIV", "factor", "termT"})), 
                                                   Production(16,"termT",SymbolString({EPS}))});
    G.productions["factor"] =       ProductionSet({Production(17, "factor", SymbolString({"TK_MINUS", "factor"})),
                                                   Production(18, "factor", SymbolString({"primary"}))});
    G.productions["primary"] =       ProductionSet({Production(19,"primary",SymbolString({"TK_NUM"})), 
                                                   Production(20,"primary",SymbolString({"TK_ID"})), 
                                                   Production(21,"primary",SymbolString({"TK_LPAREN", "expr", "TK_RPAREN"}))});


Leave A Comment