The Heart of Pratt Parsing: Top-Down Operator Precedence

mgcadmin 24-12-2024

In 1973 Vaughn Pratt published a paper about a method of top down expression parsing he devised while implementing the CGOL programming language to little fan fare. And quietly it sat forgotten as the academic world did mental gymnastics around LL(1), LR(0) and related automata based parsing methods.

Fast foward fifty years and the parsing method, which has come be to known as Pratt parsing seems to be all the rage these days, and that is in no small part thanks to the efforts of Robert Nystrom and his wonderful book Crafting Interpreters. Another paper which helped bring attention to Pratt's method was Douglas Crockford's essay "Top Down Operator Precedence" in the book Beautiful Code. In this paper he gives some history of Pratt parsing, as well as an easy to follow implementation in JavaScript.

I enjoyed Nystroms treatment of the topic, and his implementation given in the C portion of the book is rather well done. Unfortunately, due to design decisions made for the book it does not produce an AST. Additionaly, because the author was focusing on single pass compilation, the implementation is also tightly coupled to much of the non-parsing bits of the example compiler. Crockfords implementation is both thorough and well done, however it relies heavily on dynamic typing and prototype based objects. To be fair, so did Pratt's original paper - then depending on lisp instead of Javascript.

Not being one to critique without contributing, I would like to use todays post to bring pratt parsing down to its absolute most basic components and show how truly simple it really is: three (3) mutually recursive procedures, which work to consume a token stream and outputting an abstract syntax tree. So without further ado, Let's get into it.

Grammar, Grammar, Grammar

Seeing as this is a post on parsing expressions, I suppose we're going to need an expression grammar to parse. Seeing as pratt parsing is generally viewed as the "solution" to recursive descents recursion "problem", we'll use the same expression grammar as in my post on recursive descent.

<expr> := <term> { ("+"|"-") <term> }
<term> := <factor> { ("*"|"/") <factor> }
<factor> := "-" factor | primary
<primary> := ID | NUM | "("<expr>")"

Now, I'll be honest: I like recursive descent. I find it both versatile and highly adaptable, not to mention very easy to understand. The "problem" with recursive descent that pratt parsing addresses is not with warrant. When parsing a number, or unary minus operator using recursive descent we have to go down several layers of recursive calls before we arrive at the point where we can consume the token. In the above grammar, it's not that bad. However, As Douglas Crockford points out in his article a language like JavaScript can have something like 18 levels of precedence to deal with and that certainly IS alot of recursion. Heck, even C has 15 or so levels of operator precedence! It happens to be this excessive recursion that led early compiler writers to deem it impractical due to the hardware limitations of the day.

I meantioned that I like recursive descent being easy to understand for a reason. When it comes to parsing, you quickly learn the only really intuitive method of parsing is recursive descent. That doesn't mean we should just write everything else off. Pratt Parsing is also an understandable algorithm, its self being something of a rebuttal to automata-based parsers. And once you really see what's going on, you'll come to agree that it is every bit as intuitive to understand as recursive descent.

Token Streams & AST Nodes

It is the job of a parser (usually) to consume a stream of tokens, and output an Abstract Syntax Tree. At their most basic, tokens are a tuple of a Symbol, and the string that they represent. It's the job of the lexical analyzer to produce these token pairs, which is outside the scope of this article. Depicted below are the Symbols themselves, which will be recognized by the parser as well as the code for the TokenStream class which provides the api used by the parser to consume the input.

enum Symbol {
    TK_ID, TK_NUM, TK_LP, TK_RP.
    TK_ADD, TK_MUL, TK_SUB, TK_DIV,
    TK_EOI
}

struct Token {
    Symbol symbol;
    string strval;
    Token(Symbol s = TK_EOI, string st = " ") : symbol(s), strval(st) { }
};

class TokenStream {
    private:
        vector<Token> tokens;
        int tpos;
    public:
        TokenStream(vector<Token> tkns) {
            init(tkns);
        }
        TokenStream() {

        }
        void init(vector<Token> tkns) {
            tokens = tkns;
            tpos = 0;
        }
        void start() {
            tpos = 0;
        }
        bool done() {
            return tpos == tokens.size();
        }
        Token& get() {
            return tokens[tpos];
        }
        void advance() {
            tpos++;
        }
        void rewind() {
            if (tpos-1 >= 0)
                tpos--;
        }
};

As originally proposed by Pratt, this parsing is used to parse expressions. Pratt parsing can easily be extended to handle statements as well. In this post, I wont be addressing parsing statements, instead focusing only on the different types of Expression nodes in the AST:

ID_EXPR nodes for named variables.

CONST_EXPR nodes for numerical constants.

UNOP_EXPR nodes for unary operators,

BINOP_EXPR nodes for binary operator nodes.

We mark each nodes type using an enumated type which aids in guiding later traversals of the AST for sementatic analysis/evaluation/code generation/etc.

enum ExprType {
    ID_EXPR, CONST_EXPR, UNOP_EXPR, BINOP_EXPR
};

struct astnode {
    ExprType type;
    Token token;
    astnode* left;
    astnode* right;
    astnode(ExprType et,Token t) : type(et), token(t), left(nullptr), right(nullptr) { }
};

Ok, Now that we know what we are expecting as input an output, lets get to parsing!

3 Recursive Procedures And a Precdence Table

The crux of Pratt parsing, is taking the "push or pop" descision for handlng operators from the shunting yard algorith, and shoving it into the tree re-writing while loop of tail-recursive descent method, with a special procedure to handle the prefix. That might sound a little crazy now, but it will make perfect sense by the end of this post.

The following three parsing procedures are implemented in a mutually recursive fashion so that expressions are broken down to it's prefix and infix portions, or more succintly to its first and rest components in the fashion of a lisps CAR and CDR. The getPrec() procedure is for determining the precedence of the operator currently being parsed.

astnode* parsePrefix(TokenStream& ts);
astnode* parseInfix(astnode* left, TokenStream& ts);
astnode* parseExpression(TokenStream& cs, int prec);

int getPrec(Token token);

The entry point to the parser is the parseExpression() method. It is supplied with a reference to the token stream, and the precedence of the operator under consideration. When called for the first time, we used the lowest precedence. parseExpression() will then call the parsePrefix() procedure to consume the first component of the expression.

Parsing The Prefix

According to our Grammar, an Expression can start with one of four symbols: It can be a numerical constant, an identifier, a left parentheses, or unary minus operator, If the token being considered is not one of those symbols, then the phrase we are parsing is not an expression. Assuming the input is one of those symbols, we create the appropriate node for it and return it parseExpression.

Identifiers and Numerical constants are Terminals, meaning we simply have to create an ID_EXPR or NUM_EXPR node and assign their value to it.

astnode* parsePrefix(TokenStream& ts) {
    astnode* node = nullptr;
    switch (ts.get().symbol) {
        case TK_NUM: {
            Token t = ts.get();
            ts.advance();
            node = new astnode(CONST_EXPR, t);
         } break;
        case TK_ID: {
            Token t = ts.get();
            ts.advance();
            node = new astnode(ID_EXPR, t);
        } break;

A Left parentheses means that we will be processing a sub expression, and so we consume the left parentheses, call parseExpression recursively, and then consume the right parentheses upon returning from the recursive call. The result of the recursive call is return value for the parsed expression.

        case TK_LP: {
            ts.advance();
            node = parseExpression(ts, 0);
            if (ts.get().symbol != TK_RP) {
                cout<<"Where the fuck is my close paren?"<<endl;
                node = nullptr;
            } else {
                ts.advance();
            }
        } break;

And lastly, we have the case of our prefix operator, unary minus. Prefix operators are consumed by creating a UNOP_EXPR node for the operator in question, and calling parseExpression() using the precedence of the prefix operator as the precedence argument. Because unary minus and binary minus are represented by the same symbol, we do not use the getPrec() procedure in this situation. Unary minus has the highest level of precedence of all operators, so we simply supply it with a value we know is higher than the rest. The result of the recursive call is assigned to the UNOP_EXPR's left child.

        case TK_SUB: {
            Token opTok = ts.get();
            ts.advance();
            astnode* operand = parseExpression(ts, 100); //unary minus has the highest
            astnode* oper = new astnode(UNOP_EXPR, opTok);
            oper->left = operand;
        } break;
        default: break;
    }
    return node;
}

Once we have parsed the first component of the expression, it is time to handle any infix operators which might be present.

Parsing the Suffix

Upon returning from parsePrefix() we have a partially constructed AST and depending on the prefix consumed, our work is not done. Earlier on the post I described Pratt Parsing as mashup of the shunting yard algorithm and recursive descent. In the shunting yard algorithm when we encounter an operator in the token stream, we make a decision on what to do with that operator based on its precedence: we either tack it on to the expression, or push it to the stack to use later. Pratt Parsing uses the precedence of an operator to determine whether we should keep make recursive calls to parseExpression(), or return the AST we have now. If you are familiar with how the call stack works, then you should realize we are doing the same thing with the call stack as the shuting yard algorithm does with it's explicit operator stack! And the loop body? Well, that happens to be using the same tree re-writing technique employed by tail-recursive descent to create left associative trees!

astnode* parseExpression(TokenStream& ts, int prec) {
    astnode* left = parsePrefix(ts);
    while (prec < getPrecd(ts.get().symbol))  {
        Token token = ts.get();
        left = parseInfix(left, ts);
    }
    return left;
}

int getPrec(Symbol sym) {
    switch (sym) {
        case TK_ADD:
        case TK_SUB: return 5;
        case TK_MUL:
        case TK_DIV: return 7;
        default: break;
    }
    return 0;
}

The parsePrefix() method supplied us with the left child of our binary expression, which we then pass to parseInfix(). All of our infix operators have the same associativity and so we can use one method to handle all of our binary operators the same way. We simply make sure the current token is a binary operator, assign the result of parsePrefix() to our binary operator nodes left child, and recur on parseExpression() for the right child and return the result.

astnode* parseInfix(astnode* left, TokenStream& ts) {
    if (isBinOper(ts.get().symbol)) {
        astnode* t = new astnode(BINOP_EXPR, ts.get());
        ts.advance();
        t->left = left;
        t->right = parseExpression(ts, getPrecdence(t->token.symbol)+1);
        return t;
    }
    return nullptr;
}

bool isBinOper(Symbol symbol) {
    switch (symbol) {
        case TK_ADD:
        case TK_SUB:
        case TK_MUL:
        case TK_DIV:
                    return true;
        default: break;
    }
    return false;
}

And that's it! Not bad eh? Lets take it for a test drive and make sure its really doing what we expect.

Parsing Expressions

One of the quickest ways to tell if our parser is working correctly is to plug it into a simple preorder traversal, and check the resulting tree to see if they are built correctly. Formatting the output to reflect the level of recursion during the traversal helps in analyzing the resultant tree's structure.

void preorder(astnode* expr, int d) {
    if (expr != nullptr) {
        for (int i = 0; i < d; i++) cout<<" ";
        printToken(expr->token);
        preorder(expr->left, d + 1);
        preorder(expr->right, d + 1);
    }
}

astnode* parse(TokenStream& ts) {
    ts.start();
    astnode* result = parseExpression(ts, 0);
    return result;
};


void repl() {
    bool running = true;
    string buff;
    Lexer lexer;
    StringBuffer sb;
    while (running) {
        cout<<"pratt> ";
        getline(cin, buff);
        sb.init(buff);
        TokenStream& tokens = lexer.lex(sb);
        astnode* expr = parse(tokens);
        preorder(expr, 0);
        int ans = eval(expr);
        cout<<ans<<endl;
    }
}

int main() {
    repl();
}

And if we want to be extra sure, we can of course evaluate the AST generated by our parser using code like the following:

int eval(astnode* node) {
    int ans = 0;
    if (node != nullptr) {
        switch(node->type) {
            case CONST_EXPR: {
                ans = atoi(node->token.strval.data());
            } break;
            case UNOP_EXPR: {
                ans = eval(node->left);
                ans = -ans;
            } break;
            case BINOP_EXPR: {
                int l = eval(node->left);
                int r = eval(node->right);
                switch(node->token.symbol) {
                    case TK_ADD: ans = l + r; break;
                    case TK_SUB: ans = l - r; break;
                    case TK_DIV: { 
                        if (r == 0) return 0;
                        ans = l / r; 
                    } break;
                    case TK_MUL: ans = l * r; break;
                }
            }
        }
    }
    return ans;
}

A few quick expressions to check everything is working...

pratt> 2*3+4
Tokens:
[TK_ADD, +]     
  [TK_MUL, *]    
    [TK_NUM, 2]   
    [TK_NUM, 3]   
  [TK_NUM, 4]    
10
pratt> -2*3+4
[TK_ADD, +]
  [TK_MUL, *]
    [TK_SUB, -]
     [TK_NUM, 2]
    [TK_NUM, 3]
  [TK_NUM, 4]
-2
pratt>

And there you have it, one of the easiest methods of parsing expressions in three easy procedures. But is it really that much better than recursive descent?

Direct Comparison: Pratt Vs. Recursive Descent

'Pratt parsing is almost always positioned as an alternative to recursive descent, and I explained the reasoning behind this and how Pratt parsing addresses this above. What I'd like to do now is trace the execution of both algorithms as they parse the same expression to visualize the difference. Up first is Pratt parsing, which recognizes the sample expression in 11 steps.

Parsing: (2+3*4)
Algorithm: Top Down Operator Precedence (Pratt)
Step (1):   parseExpression (
Step (2):     parsePrefix (                 <- Recognized left paren
Step (3):     parseExpression 2
Step (4):       parsePrefix 2              <- recognized first term
Step (5):       parseInfx *                  <- recognized * operator
Step (6):         parseExpression 3
Step (7):           parsePrefix 3         <- recognized second term
Step (8):       parseInfx +               <- recognized + operator
Step (9):         parseExpression 4
Step (10):           parsePrefix 4      <- recognized third term
Step (11):   parsePrefix )              <- recognized right paren, expression accepted.

Up next, recursive descent, using the standard "precedence climbing" method for parsing expressions. Let's see how it handles the same sample expression:

Parsing: (2*3+4)
Algorithm: Precedence Climbing (Tail-Recursive Descent)
Step (1):     simpleExpr (
Step (2):       expr (
Step (3):         term (
Step (4):           factor (
Step (5):             var (
Step (6):               primary (                   <-recognize left paren
Step (7):                 simpleExpr 2
Step (8):                   expr 2
Step (9):                     term 2
Step (10):                       factor 2
Step (11):                         var 2
Step (12):                           primary 2      <- recognize first term
Step (13):                           var *
Step (14):                           factor *
Step (15):                         term *              <- recognizes * operator
Step (16):                           factor 3
Step (17):                             var 3
Step (18):                               primary 3       <- recognizes second term
Step (19):                               var +
Step (20):                               factor +
Step (21):                           expr +              <- recognizes operator +
Step (22):                             term 4
Step (23):                               factor 4
Step (24):                                 var 4
Step (25):                                   primary 4    <- recognize third term
Step (26):                                   var )
Step (27):                                   factor ) <- recognizes right paren, accepts expression.

Pratt Parsing consumed and recognized the entire expression in 11 steps. Recursive descent took 12 steps just to recognize the first number. I don't think I really need to say anything more about that.

Making It Table Driven

Thus far I have kept everything in grouped in procedures the way they are because I wanted to highlight how this algorithm truly processes the data by breaking it down into pieces, represented by three mutually recursive procedures. With that being said, this organization is for educational purposes. For anything larger than the simple calculator grammar shown above, this is not how you want to organize your parser.

For parsing something like a programming language each case can (should) be extracted to a separate method, or as I do below, functor object. This is done not just for the sake of code clarity, but it also aids in the conversion to being driven by a table instead of switch statement.

struct PrefixHandler {
    virtual astnode* operator()(TokenStream& ts) = 0;
    virtual astnode* parse(TokenStream& ts) = 0;
};

struct NameOrNumberHandler : PrefixHandler {
    astnode* operator()(TokenStream& ts) {
        astnode* node = nullptr;
        switch (ts.get().symbol) {
            case TK_NUM: {
                Token t = ts.get();
                ts.advance();
                node = makeExprNode(CONST_EXPR, t);
            } break;
            case TK_ID: {
                Token t = ts.get();
                ts.advance();
                node = makeExprNode(ID_EXPR, t);
            } break;
        }
        return node;
    }
    astnode* parse(TokenStream& ts) {
        return (*this)(ts);
    }
};

struct SubExpressionHandler : PrefixHandler {
    astnode* operator()(TokenStream& ts) {
        ts.advance();
        astnode* t = parseExpression(ts, 0);
        if (ts.get().symbol != TK_RP) {
            cout<<"missing close paren."<<endl;
        } else {
            ts.advance();
        }
        return t;
    }
    astnode* parse(TokenStream& ts) {
        return (*this)(ts);
    }
};

struct UnaryOpHandler : PrefixHandler {
    astnode* operator()(TokenStream& ts) {
        Token opTok = ts.get();
        ts.advance();
        astnode* operand = parseExpression(ts, 100); //unary minus has the highest
        astnode* oper = makeExprNode(UNOP_EXPR, opTok);
        oper->left = operand;
        return oper;
    }
    astnode* parse(TokenStream& ts) {
        return (*this)(ts);
    }
};

astnode* parsePrefix(TokenStream& ts) {
    switch (ts.get().symbol) {
        case TK_NUM: return NameOrNumberHandler()(ts);
        case TK_ID:  return NameOrNumberHandler()(ts);
        case TK_LP:  return SubExpressionHandler()(ts);
        case TK_SUB: return UnaryOpHandler()(ts);
        default: break;
    }
    return nullptr;
}

We can remove the switch statement, and while were at it remove the entire call to parsePrefix, by incorporating a hashtable to map our input symbols to their corresponding function object.

unordered_map<Symbol, PrefixHandler*> prefixParsers;

prefixParsers[TK_NUM] = new NameOrNumberHandler();
prefixParsers[TK_ID] = new NameOrNumberHandler();
prefixParsers[TK_LP] = new SubExpressionHandler();
prefixParsers[TK_SUB] = new UnaryOpHandler();

And now in our parseExpression() procedure, we replace the call to parsePrefix(), by looking up the correct prefix parser in a hashtable by the current token symbol. This also allows us to do our error checking earlier. If there is an invalid prefix, we dont have to go through the trouble of setting up the function call only to find its invalid, now we know before the call would even be made, allowing us to exit earlier in the parse than before.

astnode* parseExpression(TokenStream& ts, int prec) {
     PrefixParselet* pp = prefixParsers[ts.get().symbol];
     if (pp == nullptr) {
        cout<<"Error: Invalid Prefix."<<endl;
        return nullptr;
    }
    astnode* left = pp->parse(ts);
    while (prec < getPrecdence(ts.get().symbol))  {
        Token token = ts.get();
        left = parseInfix(left, ts);
    }
    return left;
}

We can of course give the same treament to parseInfix() reaping the same benefits. Thats all I've got for today. I hope this post helped to clear up any confusion you may have had about parsing expressions with top down operator precedence. Until next time, Happy Hacking!

MaxGCoding.com

A software Engineering Space

The Heart of Pratt Parsing: Top-Down Operator Precedence

Grammar, Grammar, Grammar

Token Streams & AST Nodes

3 Recursive Procedures And a Precdence Table

Parsing The Prefix

Parsing the Suffix

Parsing Expressions

Direct Comparison: Pratt Vs. Recursive Descent

Making It Table Driven

Leave A Comment

Implementing An Iterator for In-Memory B-Trees

Weight Balanced Binary Search Trees

Parsing Array Subscript Operators with Recursive Descent

Implementing Map & Filter in Scheme & C

Persistent Symbol Tables for Lexical Scoping

The Festival of 1 + n + f(n-1) Lights

The Heart of Pratt Parsing: Top-Down Operator Precedence

Compiling expressions to P-Code by AST Traversal

Ternary Search Tries: String Specific Ordered Symbol Tables

Digital Search Trees