Top-Down AST Construction of Regular Expressions with Recursive Descent

mgcadmin 09-13-2025

In my past projects when needed I have always implemented AST building parsers for regular expressions bottom-up by using the shunting yard algorithm. I've covered the process of doing so in my post on Thompsons Construction. While this approach certainly works well enough, it does have a down side. As the suite of operators one wishes to support expands the process of inserting explicit concatentation operators which that approach requires becomes increasingly complex and thus error prone.

It was actually during a particularly tedious debugging session with said procedure that I initially began thinking about using a top-down recursive descent approach to parsing regular expressions instead. In todays post I'm going to be exploring creating an AST for regular expressions using recursive descent to parse the expression.

Parsing Regular Expressions & Abstract Syntax Trees

Strictly speaking, creating an Abstract Syntax Tree is not required to build an NFA from a regular expression. It does however make the process a whole lot easier. For example, once we have our AST built the main loop of thompsons construction can be implemented as a straightforward recursive procedure:

NFA compile(astnode* node) {
       NFA nfa;
       if (node == nullptr) {
            nfa = sentinel;
        } else if (node->type == LITERAL) {
            nfa = makeAtomic(node->c, false);
        } else {
            if (node->c == '@') {
                NFA lhs = compile(node->left);
                NFA rhs = compile(node->right);
                NFA nfa = makeConcat(lhs, rhs);
            } else if (node->c == '|') {
                NFA lhs = compile(node->left);
                NFA rhs = compile(node->right);
                NFA nfa = makeUnion(lhs, rhs);
            } else if (node->c == '*') {
                NFA lhs = compile(node->left);
                NFA nfa = makeClosure(lhs);
            }
        }
        return nfa;
}

Having a concrete AST also expands the possibilities of which type of construction algorithm can be used to build the finite automaton (FA). Aside from the above Thompsons Construction example, Russ Cox's virtual machine approach for NFAs and the Aho, Sethi, Ullman direct DFA algorithm both require explicit ASTs, for example.

The AST Structure

A simple homogenous node structure suits our purposes fine as the only two types of nodes our tree can contain are Literal nodes, which are always leaf nodes, and operator nodes which are always internal nodes.

enum NodeType {
    LITERAL, OPERATOR
};

struct astnode {
    NodeType type;
    char c;
    astnode* left;
    astnode* right;
    astnode(char ch, NodeType t) : type(t), c(ch), left(nullptr), right(nullptr) { }
};

To ensure we build the correct parse tree from a given regular expression, we will of course need to know the precedence and associativity of the operators. For that, we turn to grammars.

Regular Expression Grammar

Regular expressions have a simple grammar with only a few non-terminal symbols, and not too many terminal symbols either which means while we must always be careful with recursive procedures we dont need to excessively worry about the stack size with this particular grammar.

expr := term | (factor) '|' (term)
term := factor | (factor)(term)
factor := v | (v) | v*
v := digit | character

Each non-terminal will have it's own method: expr(), term(), factor(). v is not an actual non-terminal, its just an aid for describing the grammar.

Parsing With Recursive Descent

Parsing a regular expression with recursive descent proceeds just as it would for any other language. Regular expressions can be parsed with a single token of lookahead (LA(1)). As such, I implement the procedure lookahead() which returns the current character being examined to aid us in our parse.

When we determine the current character being read is indeed the character we're expecting we consume the token with the match() procedure. Match() verifies its the correct character and then calls the advance() method internally to move the character position one place forward.

class Parser {
    private:
       string rexpr;
       int pos;
       void advance();
       bool match(char c);
       char lookahead();
       void init(string regex);
       astnode* expr();
       astnode* term();
       astnode* factor();
   public:
       Parser() { }
       astnode* parse(string regex);
};

When avance() is called we check that incrementing the index pointer wont call an invalid index the next time we call lookahead(), advance(), or match().

string rexpr;
int pos;
void advance() {
     if (pos < rexpr.length())
         pos++;
}
bool match(char c) {
     if (c == rexpr[pos]) {
          advance();
          return true;
     }
     return false;
}
char lookahead() {
         return rexpr[pos];
 }
void init(string regex) {
        rexpr = regex;
        pos = 0;
}

With the "machinery" out of the way, lets get to the fun part: the actual parsing. The entry point for the parser is calling the parse() method with the string to be parsed. The parser is then initialized with the supplied string and parsing proceeds by calling expr().

Alternation

The grammar for the <expr> non-terminal tells us that an <expr> is either a <term>, or it's a choice between a <term> and a "different" <expr>. We call the term() option first to make Alternation right-recursive as being left-recursive would cause an infinite loop/stack overflow.

astnode* expr() {
       astnode* t = term();
       if (lookahead() == '|') {
            astnode* n = new astnode('|', 2);
            match('|');
            n->left = t;
            n->right = expr();
            t = n;
        }
        return t;
}

Concatenation

The <term> non-terminal is described by our grammar as being either a <factor>, or a <factor> immediately followed by a <term>. Like expr(), term() is also right-recursive.

In order to create a concat node, the result of our initial call to factor() and the return value of lookahead() should both be Literals or a subexpression.

astnode* term() {
        astnode* t = factor();
        if (lookahead() == '(' || (isdigit(lookahead()) || isalpha(lookahead()))) {
            astnode* n = new astnode('@', 2);
            n->left = t;
            n->right = term();
            t = n;
        }
        return t;
}

Subexpressions, Closures, and Literals

Factor() is a bit more interesting than expr() and term() in that it has to handle disjoint cases, i.e., we might create a literal, and then instead of returning it, create a closure node over that literal node, and return that. This requires paying careful attention to how the control flow statements are ordered, with literals handled first, and then checking if the new lookahead() is a closure operator.

astnode* factor() {
        astnode* t;
        if (lookahead() == '(') {
            match('(');
            t = expr();
            match(')');
        } else if (isdigit(lookahead()) || isalpha(lookahead())) {
            t = new astnode(lookahead(), 1);
            advance();
        }
        if (lookahead() == '*') {
             astnode* n = new astnode(lookahead(), 2);
             match('*');
             n->left = t;
             t = n;
         }
        return t;
}

Testing It out

The best way to test our parser, would obviously be to plug our tree into an NFA compiler, bar doing that we can still check that the tree it produces is correct via a traversal of the tree. I like to use either pre-order or in-order traversals for viewing ASTs.

void print(astnode* t, int d) {
    d++;
    if (t != nullptr) {
        print(t->left, d+1);
        for (int i = 0; i < d; i++) cout<<" ";
        cout<<t->c<<endl;
        print(t->right, d+1);
    }
    d--;
}

int main(int argc, char* argv[]) {
    if (argc < 3) {
        return 0;
    }
    Parser parser;
    astnode* ast = parser.parse(argv[1]);
    print(ast, 1);
    return 0;
}

And the output:

max@thinkpad:~/nfa$ ./ok "((a*b*|ac)d)"
           a
         *
       @
           b
         *
     |
         a
       @
         c
   @
     d

Well, Thats what I've got for today. Whether you decide to parse your regular expressions top down with recursive descent, or bottom up with the shunting yard, Happy Hacking!

MaxGCoding.com

For the love of programming

Top-Down AST Construction of Regular Expressions with Recursive Descent

Parsing Regular Expressions & Abstract Syntax Trees

The AST Structure

Regular Expression Grammar

Parsing With Recursive Descent

Alternation

Concatenation

Subexpressions, Closures, and Literals

Testing It out

Leave A Comment

A Quick tour of MGCLex

Compiling Regular Expressions for "The VM Approach"

Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction

Improving the Space Efficiency of Suffix Arrays

Augmenting B+ Trees For Order Statistics

Top-Down AST Construction of Regular Expressions with Recursive Descent

Balanced Deletion for in-memory B+ Trees

Building an AST from a Regular Expression Bottom-up

The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from the Followpos Table

The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table