Parsing Regular Expressions Top-Down

In project past I have always implemented AST building parsers for regular expressions bottom-up by using the shuntyard algorithm, as outlined in my post on Thompsons Construction. As the suite of operators one wishes to support expands the implementation of the explicit concatentation operation which that approach requires becomes increasingly complex and thus error prone, not to mention difficult to understand.

In todays post I'm going to be exploring creating an AST for regular expressions using recursive descent to parse the expression. 

Parsing Regular Expressions

Strictly speaking, creating an Abstract Syntax Tree is not required to build an NFA from a regular expression. It does however make the process a whole lot easier. Once we have our AST, thompsons construction can be implemented as a straightforward recursive procedure:

NFA compile(astnode* node) {
       NFA nfa;
       if (node == nullptr) {
            nfa = sentinel;
        } else if (node->type == LITERAL) {
            nfa = makeAtomic(node->c, false);
        } else {
            if (node->c == '@') {
                NFA lhs = compile(node->left);
                NFA rhs = compile(node->right);
                NFA nfa = makeConcat(lhs, rhs);
            } else if (node->c == '|') {
                NFA lhs = compile(node->left);
                NFA rhs = compile(node->right);
                NFA nfa = makeUnion(lhs, rhs);
            } else if (node->c == '*') {
                NFA lhs = compile(node->left);
                NFA nfa = makeClosure(lhs);
            }
        }
        return nfa;
}

Having a concrete AST also expands the possibilities of which type of construction algorithm can be used to build the finite automaton (FA). Aside from the above Thompsons Construction example, Russ Cox's virtual machine approach for NFAs and the Aho, Sethi, Ullman direct DFA algorithm both require explicit ASTs, for example.

The AST Structure 

 A simple homogenous node structure suits our purposes fine as the only two types of nodes our tree can contain are Literal nodes, which are always leaf nodes, and operator nodes which are always internal nodes.

enum NodeType {
    LITERAL, OPERATOR
};

struct astnode {
    NodeType type;
    char c;
    astnode* left;
    astnode* right;
    astnode(char ch, NodeType t) : type(t), c(ch), left(nullptr), right(nullptr) { }
};

To ensure we build the correct parse tree from a given regular expression, we will of course need to know the precedence and associativity of the operators. For that, we turn to grammars.

Regular Expression Grammar

Regular expressions have a simple grammar with only a few non-terminal symbols, and not too many terminal symbols either which means while we must always be careful with recursive procedures we dont need to excessively worry about the stack size with this particular grammar.

expr := term | (factor) '|' (term)
term := factor | (factor)(term)
factor := v | (v) | v*
v := digit | character

Each non-terminal will have it's own method: expr(), term(), factor(). v is not an actual non-terminal, its just an aid for describing the grammar.

Parsing With Recursive Descent

Parsing a regular expression with recursive descent proceeds just as it would for any other language. Regular expressions can be parsed with a single token of lookahead (LA(1)). As such, I implement the procedure lookahead() which returns the current character being examined to aid us in our parse.

When we determine the current character being read is indeed the character we're expecting we consume the token with the match() procedure. Match() verifies its the correct character and then calls the advance() method internally to move the character position one place forward.

class Parser {
    private:
       string rexpr;
       int pos;
       void advance();
       bool match(char c);
       char lookahead();
       void init(string regex);
       astnode* expr();
       astnode* term();
       astnode* factor();
   public:
       Parser() { }
       astnode* parse(string regex);
};

When avance() is called we check that incrementing the index pointer wont call an invalid index the next time we call lookahead(), advance(), or match().

string rexpr;
int pos;
void advance() {
     if (pos < rexpr.length())
         pos++;
}
bool match(char c) {
     if (c == rexpr[pos]) {
          advance();
          return true;
     }
     return false;
}
char lookahead() {
         return rexpr[pos];
 }
void init(string regex) {
        rexpr = regex;
        pos = 0;
}

With the "machinery" out of the way, lets get to the fun part: the actual parsing. The entry point for the parser is calling the parse() method with the string to be parsed. The parser is then initialized with the supplied string and parsing proceeds by calling expr().

Alternation

The grammar for the <expr> non-terminal tells us that an <expr> is either a <term>, or it's a choice between a <term> and a "different" <expr>. We call the term() option first to make Alternation right-recursive as being left-recursive would cause an infinite loop/stack overflow.

astnode* expr() {
       astnode* t = term();
       if (lookahead() == '|') {
            astnode* n = new astnode('|', 2);
            match('|');
            n->left = t;
            n->right = expr();
            t = n;
        }
        return t;
}

Concatenation

The <term> non-terminal is described by our grammar as being either a <factor>, or a <factor> immediately followed by a <term>. Like expr(), term() is also right-recursive.

In order to create a concat node, the result of our initial call to factor() and the return value of lookahead() should both be Literals or a subexpression.

astnode* term() {
        astnode* t = factor();
        if (lookahead() == '(' || (isdigit(lookahead()) || isalpha(lookahead()))) {
            astnode* n = new astnode('@', 2);
            n->left = t;
            n->right = term();
            t = n;
        }
        return t;
}

Subexpressions, Closures, and Literals

Factor() is a bit more interesting than expr() and term() in that it has to handle disjoint cases, i.e., we might create a literal, and then instead of returning it, create a closure node over that literal node, and return that. This requires paying careful attention to how the control flow statements are ordered, with literals handled first, and then checking if the new lookahead() is a closure operator.

astnode* factor() {
        astnode* t;
        if (lookahead() == '(') {
            match('(');
            t = expr();
            match(')');
        } else if (isdigit(lookahead()) || isalpha(lookahead())) {
            t = new astnode(lookahead(), 1);
            advance();
        }
        if (lookahead() == '*') {
             astnode* n = new astnode(lookahead(), 2);
             match('*');
             n->left = t;
             t = n;
         }
        return t;
}

 


Leave A Comment