The Shunting Yard Algorithm: Simple Bottom-up Expression Parsing

In the late 1960s and continuing through the 1970s there was a flurry of research activity into the theory of compilation in which many of the algorithms, patterns, and techniques used to implement compilers and interpreters that we use today were developed. Of practical concern at the time was the efficient parsing of expressions for translation to machine code. Working within the constraints of computing hardware at the time, holding the entire source text + the compiler + the compiler output in memory was impractical, if not outright impossible at the time. Thus, much research was done into correctly parsing expressions with as little (or no) context as possible: no small feat.

One of the early products of this research was "the shunting yard" algorithm comes to us from the work of Edsgar Dijkstra. Dijkstra was working on implementing one of the many early compilers for Algol 60 which he contributed to . The algorithm takes a string representing an infix expression as input and would output a valid postfix expression. Postfix expressions, also called reverse polish notation (rpn) was a common way of interacting with early desk calculators, as postfix expressions are easier to compute using a stack. You may recall several months back I covered his 2 stack algorithm for infix expression evaluation. This time we only require one stack for evaluation. Progress!

Despite being a relatively straight forward algorithm, it does have subtle nuances that must be accounted for in order to produce the correct output. So, without further ado, lets take a look.

The shunting yard algorithm

The algorithm is named for a type of train switch, like actual trains (choo choo!). In said switching yard, box cars are de-coupled from the train, placed on a differnt track called a "siding" (their putting it to to the side, get it?), and then either re-coupled in a different position later in the train, or to a different train altogether. Or just dropped off. but that's not the important part. This is a very good allegory, as it explains - generally - how the algorithm works:

  • Iterate over the string from left to right, if the token being examined is:
  •    1)  a number, we append it to the output string.
  •    2)  an operator, or left parentheses, then it is placed on the operator stack.
  • If we encounter a right parentheses we pop the operators off of the stack and append them to the output string until encountering its matching left parentheses, at which point we discard both parentheses and continue.
  • Once the string is consumed, append what ever operators remain on the stack to the end of the string.

This would be all there is to it if it were not for one slight caveat: in our case, some of our theoretical "box cars" (operators) have a higher priority than others when it comes to being put back in place(operator precedence). We're going to have to do something about that. This is accomplished by making a slight tweak to rule #2 regarding encountering an operator. Instead of just placing every operator directly on the operator stack, we will first check the precedence level of the newly encountered operator and compare it to the operator on the top of the stack. Which ever operator has the higher precedence gets appended to the output string, and the operator with lower precedence gets placed on to the operator stack. 

string shuntingYard(string expr) {
    stack<char> ops;
    string output;
    for (char c : expr) {
        if (c == '(')
            ops.push(c);
        else if (c == '+' || c == '-' || c == '*' || c == '/' || c == '^') {
                if (precedence(c) < precedence(ops.top()) || (precedence(c) == precedence(ops.top()) && leftAssociates(c))) {
                    output.push_back(' ');
                    output.push_back(ops.pop());
                    output.push_back(' ');
                    ops.push(c);
                } else {
                    ops.push(c);
                    output.push_back(' ');
                }
        } else if (c == ')') {
            while (!ops.empty())
                if (ops.top() != '(')
                    output.push_back(ops.pop());
                else {
                    ops.pop();
                    break;
                }
        } else output.push_back(c);
    }
    while (!ops.empty()) 
        if (ops.top() != '(')
            output.push_back(ops.pop());
        else ops.pop();
    return output;
} 

Thankfully, we can determine the precedence of an operator with a simple O(1) table lookup, so it doesn't give us the same kind of performance hit that you get from precedence climbing with recursive descent. In this case I use a simple switch statement to match the precedence to operator, as well as the associativity of said operator, though you can also use a hash table, array lookup, or whatever look-up structure you prefer, just keep in mind it should be an O(1) look up operation.

bool leftAssociates(char c) {
    switch (c) {
        case '^': return false;
        case '*': return true;
        case '/': return true;
        case '+': return true;
        case '-': return true;
        default:
            break;
    }
    return false;
}

int precedence(char c) {
    switch (c) {
        case '^': return 7;
        case '*'    : return 5;
        case '/': return 5;
        case '+': return 3;
        case '-': return 3;
        default:
            break;
    }
    return 0;
} 

Testing the algorithm we can see that our expressions are being correctly converted to their postfix forms:

max@MaxGorenLaptop:~/shunting$ ./sy
13+9/7 -> 13 9 7 / +
65/((3/4)5) -> 6 5 * 3 4 / 5 /
(306+4213)/(1-5) -> 306 42 13 + 1 5 - /
3+42/(1-5)^2^3 -> 3 4 2 * 1 5 - 2 3 ^ ^ / +

Ok, so now that we have our infix expressions translated to postfix expressions, how do we evaluate them? I'm glad you asked, because we have a number of options open to us:

1) We can evaluate the strings themselves directly using a stack,

2) we can convert the string to a collection of tokens and evaluate that with a stack (better, not by much), or

3) we can convert them to an abstract syntax tree suitable for further processing in a compiler/interpreter or be used for evaluation.

I mean, I know which one I would choose...

Lexical Analysis, A detour.

You could certainly take the string returned by the call to shuntingYard() and evaluate it without performing this step. Alternatively, you could choose to perform this step first, and apply the shunting yard algorithm to the output of this step. Regardless of the order in which you perform the steps, separating the tokenization from the evaluation not only increases the readability of the code, it will also ease later additions to the algorithm.

Considering we're only dealing with two types of tokens right now, numbers and operators, "lexing" our instructions is very simple. We will be returning a vector of 'Tokens', which are a simple tagged union with a tag to represent if the token is a number token or an operator token, with the type value populated automatically by the constructor.

struct Token {
    int type;
    union {
        double num;
        char op;
    };
    Token(double n) : type(0), num(n)  { }
    Token(char c) : type(1), op(c) { }
}; 

The only thing our lexer need do is group digits together to create number tokens, and recognize operator characters to create operator tokens. This can be done in a single pass over the string, and so is O(n) where n is the number of characters in the expression string.

 vector<Token> lex(string pfExpr) {
    vector<Token> tokens;
    for (int i = 0; i < pfExpr.size(); i++) {
        char c = pfExpr[i];
        if (isdigit(c)) {
            string num;
            do {
                num.push_back(c);
                c = pfExpr[++i];
            } while (i < pfExpr.size() && isdigit(c));
            tokens.push_back(Token(stof(num)));
            i--;
            continue;
        } else if (c == '+' || c == '-' || c == '*' || c == '/' || c == '^')
            tokens.push_back(Token(c));
    }
    return tokens;
} 

As it stands now, the string is used as input to shuntingYard(), which returns a postfix expression as a string. This string is consumed by lex() which returns a vector of tokens. I guess at this point we should put together a quick driver program to test what we've done so far. A simple event loop serves as a basic REPL:

void repl() {
    string input;
    while (true) {
        cout<<"calc> ";
        getline(cin, input);
        if (input == "exit")
            return 0;
        vector<Token> tokens = lex(shuntingYard(input));
        for (Token token : tokens) {
            if (token.type == 0) cout<<"NUM, "<<token.num<<endl;
            if (token.type == 1) cout<<" OP, "<<token.op<<endl;
         }
    }
    return 0;
} 

Using the above to test the output from lex() should produce the output shown below. Because we call shuntingYard() on the string and then lex the transformed postfix expression, the output is suitable for evaluation by a stack.

max@MaxGorenLaptop:~/shunting$   ./expr
calc> 2+2
NUM, 2
NUM, 2
OP, +
calc> 3+4/2*16-4
NUM, 3
NUM, 4
NUM, 2
OP, /
NUM, 16
OP, *
NUM, 4
OP, -
OP, +
calc>

Evaluation Method One: Stack Based Evaluation

This is by far the simplest method of evaluating mathematical expressions. The output shown above that is generated by our lex() procedure can be viewed as a list of instructions for a finite state machine. The state machine is very simple, containing "code page" which is a list of instructions to be executed. It also has a stack that serves as its only form of memory. There is a program counter which is the index of the current instruction being executed. If the current instruction is 'NUM', then we push the associated number on to the stack. If the instruction is OP we pop the top two items from the stack, perform the supplied operation on them, and store the result on the stack. We continue in this way until there are no more instructions, at which point our expression will have been evaluated. I suggest working it through by hand on paper a few times, as it will give a deeper understanding of what's happening. If you're using std::stack, you either need to overload pop() to return the item being removed, or alter the following code to perform a top()/pop() pair of calls.

 double eval(vector<Token> codePage) {
    stack<double> es;
    for (int pc = 0; pc < codePage.size(); pc++) {
        Token token = codePage[pc];
        if (token.type == 0) {
            es.push(token.num);
        }
        if (token.type == 1) {
            char c = token.op;
            double rhs = es.pop();
            double lhs = es.pop();
            cout<<lhs<<" "<<c<<" "<<rhs<<endl;
            switch (c) {
                case '+': 
                    es.push(lhs+rhs); 
                    break;
                case '-': 
                    es.push(lhs-rhs); 
                    break;
                case '*': 
                    es.push(lhs*rhs); 
                    break;
                case '/': 
                    es.push(lhs/rhs); 
                    break;
                case '^': 
                    es.push(pow(lhs,rhs)); 
                    break;
                default:
                    break;
            }
        } 
    }
    return es.pop();
} 

And of course a minor change to our driver:

void repl() { 
    string input;
    while (true) {
        cout<<"calc> ";
        getline(cin, input);
        if (input == "exit")
            return 0;
        cout<<eval(lex(shuntingYard(input)))<<endl;
    } 
}



max@MaxGorenLaptop:~/shunting$ ./expr
calc> 3+3
3 + 3
6
calc> (2+3*4)
3 * 4
2 + 12
14
calc> 3+4/2*16-4
4 / 2
2 * 16
32 - 4
3 + 28
31
calc>

If all you're after is a simple calculator, the above implementation will do the job just fine. It is also possible however, and not much more difficult at that, to use the output of our lex() procedure to build an abstract syntax tree, which allows us to do much more with the same input.

Evaluation Method Two: Expression Trees

The second evaluation method I'm going to cover in this post is the use of Expression Trees. Expression tree's are a special case of Abstract Syntax Tree, and are evaluated in the same way. But before we can evaluate an expression tree, we have to build it. Building an expression tree from a postfix expression is performed bottom up using a stack, once again using the vector of tokens provided by lex() as our input. The output of our procedure will be a binary tree, with internal nodes being operators and leaf nodes being values. The nodes them selves will simply store the tokens.

struct node {
    Token token;
    node* left;
    node* right;
    node(Token tok) : token(tok), left(nullptr), right(nullptr) { }
}; 

The algorithm begins with an initially empty stack, and our vector of tokens. Iterating over the tokens, we check the token type, and if it is a number token, we make a new node for that token and push it onto the stack. If on the other hand the token is a op token, we make a new node for the op, and pop the first two nodes of the stack, setting them as the op nodes left and right child respectively. We then push the op node on to the stack. This continues until we have consumed all of the tokens. At then end, the token should contain the root node of our expression tree, which we return as the result.

node* buildTree(vector<Token> tokens) {
    stack<node*> sf;
    for (int i = 0; i < tokens.size(); i++) {
        Token c = tokens[i];
        if (c.type == 1) {
            node* x = new node(c);
            x->right = sf.pop();
            x->left = sf.pop();
            sf.push(x);
            cout<<"Push: "<<sf.top()->token.op<<endl;
        } else
        if (c.type == 0) {
            sf.push(new node(c));
            cout<<"Push: "<<c.num<<endl;
            continue;
        }
    }
    return sf.top();
}

Evaluation of the resultant tree is done by performing a post-order depth first traversal of the expression tree. This is done with a call to our new eval() procedure, which expects the root of the expression tree as its parameter. If the nodes token type is a number, it return its value. If it is an op token, it first performs the post order traversal to obtain the left and right side values, which it then applys the supplied operator to and returns the result.

double eval(node* x) {
    if (x == nullptr)
        return 0;
    if (x->token.type == 0)
        return x->token.num;
    if (x->token.type == 1) {
        double lhs = eval(x->left);
        double rhs = eval(x->right);
        switch (x->token.op) {
            case '^': return pow(lhs, rhs);
            case '*': return lhs*rhs;
            case '/': 
                if (rhs == 0) {
                    cout<<"Error: divide by zero."<<endl;
                } else 
                    return lhs/rhs;
            case '+': return lhs+rhs;
            case '-': return lhs-rhs;
            default:
            break;
        }
    }
    return 0;
} 

Because there is now another step between lex and eval, we have to make one slight change to the driver:

void repl() { 
    string input;
    while (true) {
        cout<<"calc> ";
        getline(cin, input);
        if (input == "exit")
            return 0;
        cout<<eval(buildTree(lex(shuntingYard(input))))<<endl;
    } 
}


max@MaxGorenLaptop:/mnt/c/Users/mgoren$ ./expr
calc> 3+4/2*16-4
Push: 3
Push: 4
Push: 2
Push: /
Push: 16
Push: *
Push: 4
Push: -
Push: +
31



So there you have it: The shunting yard algorithm, and two methods of evaluating its output. That's all I've got for today, so until next time, Happy Hacking!

 


Leave A Comment