Parsing Array Subscript Operators with Recursive Descent

mgcadmin 18-01-2025

Well, It looks like I'm back with another Recursive Descent post. This time I'm going to learn you how to parse those pesky Nth dimensional array references and deeply nested object members using recursive descent. So grab some coffee, strap in, and lets learn how to take a piece of code like this:

arr[1][0][3]

And construct an AST like this:

    [subscript expression][TK_LSQUARE, '[' ]
      [subscript expression][TK_LSQUARE, '[' ]
        [subscript expression][TK_LSQUARE, '[' ]
          [id expression][Scope: global][TK_ID, 'arr' ]
          [const expression][TK_NUM, 1 ]
        [const expression]][TK_NUM, 0 ]
      [const expression][TK_NUM, 3 ]

Let's do it.

The Grammar

The following expression grammar has everything that we need (and more) to explore parsing array subscripts and object members, but it helps to see where they fall in the overall expression grammar before we zoom in on the part we want to focus on.

Using the following EBNF grammar, we can see that the subscript operator has the highest precedence level, on par with the dot operator, which is used for accessing object members, and function calls.

assignexpr := simpleExpr { ':=' simpleExpr }*
simpleExpr := relExpr { == | != relExpr }*
relExpr := expr { < | > | <= | >=  expr }*
expr  :=  term { +|- term }*
term := factor { *|/ factor }*
factor :=  - factor | ! factor | subsExpr
subsExpr :=  primary {  '(' argsList ')'  |  { '[' expr ']' | '.'primary }* }
primary :=  ID | NUM | (expr)
argsList := expr { , expr }*

You would certainly be forgiven for mistaking the subscript operator as a postfix operator at first glance. I mean, we stick it on the end of the Identifier dont we? Well, yes we do, but operator[] is actually parsed as an infix operator. We begin parsing a subscript operator with a call to primary(), which if the expression being parsed turns out to be a subscript expession, will be come the left child of the AST.

Primary is the 'bottom' of the grammar, which means its corresponding function is the 'end' of the call chain: regardless of If we match anything in primary or not, we start unwinding back up the call stack (or climbing back down the precedence ladder, depending on how you want to think about it).

 astnode* Parser::primary() {
    astnode* node;
    switch (currSym()) {
        case TK_ID: {
             node = makeExprNode(ID_EXPR, current);
             match(TK_ID); 
        }
        break;
        case TK_NUM: {
             node = makeExprNode(CONST_EXPR, current);
             match(TK_NUM);
        }
        break;
        case TK_LPAREN: {
             match(TK_LPAREN);
             node = simpleExpr();
             match(TK_RPAREN);
        }
        default:
           break;
    }
    return node;
}

In the case of parsing an array element reference, the call to primary() should return an ID_EXPR node for the name of the array being accessed. At the moment, this would be the root of the AST under construction:

  [id expression][Scope: global][TK_ID, 'arr' ]

Parsing Subscript Operators

The above grammar handles the case for a possible function call before handling the possibility of an subscript operator or dot operator, and so by convention so will we. The above grammar does not account for the possibility of a "multidiminesional function call", or something like function(1,2)(3,4) and while this situation is incredibly rare, if you do want to support it, it can be done by replacing the if statement with a while loop like is done for the subscript operator and dot operator.

astnode* Parser::subscript() {
    astnode* m = primary();

    if (currSym() == TK_LPAREN) {
        m->exprType = FUNC_EXPR;
        match(TK_LPAREN);
        if (currSym() == TK_RPAREN) {
            match(TK_RPAREN);
            return m;
        } else {
            m->child[1] = argsList();
            match(TK_RPAREN);
        }
    }

    while (currSym() == TK_LSQUARE) {
        astnode* t = makeExprNode(SUBSCRIPT_EXPR, current);
        match(TK_LSQUARE);
        t->child[0] = m;
        m = t;
        m->child[1] = primary();
        match(TK_RSQUARE);
    }

    while (currSym() == TK_PERIOD) {
        astnode* t = makeExprNode(OBJECT_DOT_EXPR, current);
        match(TK_PERIOD);
        t->child[0] = m;
        m = t;
        m->child[1] = primary();
    }
    return m;
}

The subscript() procedure uses tree re-writing and tail call elimination to process N-dimensional array references. As you will recall from a above, when the first call to primary() returns, it returns the ID_EXPR Node it created, which is assigned to the 'm' node pointer and logically, is the root of our AST.

Continuing with the parse process, the current symbol to match is the open left brace of a subscript expression, which we know how to handle.

    while (currSym() == TK_LSQUARE) {
        astnode* t = makeExprNode(SUBSCRIPT_EXPR, current);
        match(TK_LSQUARE);
        t->child[0] = m;
        m = t;
        m->child[1] = primary();
        match(TK_RSQUARE);
    }

A subscript expression node 't' is created, and we promote it to being the new root node by making the 'm' node the left most child of the new 't' node, and assigning 't' to 'm'. At this point, the AST has gone from looking like this:

  [id expression][Scope: global][TK_ID, 'arr' ]

And transformed into looking like this:

 [subscript expression][TK_LSQUARE, '[' ]
          [id expression][Scope: global][TK_ID, 'arr' ]

We then assign the result of a call to simpleExpr() to the right child, which is what is the expression to be evaluated to determine which element to access, yielding us an AST for a one dimensional array reference:

  [subscript expression][TK_LSQUARE, '[' ]
          [id expression][Scope: global][TK_ID, 'arr' ]
          [const expression][TK_NUM, 1 ]

Parsing k-D array references

Having been implemented using tail recursion elimeination the process starts a new on the next iteration of the loop. The completed tree shown above for a 1D array access now becomes the leftmost child of a new SUBSCRIPT_EXPR node, with a new right node created for the new dimensions subscript:

   [subscript expression][TK_LSQUARE, '[' ]
        [subscript expression][TK_LSQUARE, '[' ]
             [id expression][Scope: global][TK_ID, 'arr' ]
             [const expression][TK_NUM, 1 ]
        [const expression]][TK_NUM, 0 ]

This process continues out to N dimensions until it runs out of subscript expressions to parse. Additionally, the dot operator - used for accessing object members - has the same associativity and precedence as the subscript operator. Parsing multi level object member references is performed in much the same way as array references, and you are certainly encouraged to convince yourself of this using the code presented.

That's all I've got for today, so until next time Happy Hacking!

MaxGCoding.com

For the love of programming

Parsing Array Subscript Operators with Recursive Descent

The Grammar

Parsing Subscript Operators

Parsing k-D array references

Leave A Comment

The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from Followpos

The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table

Procedural Map Generation with Binary Space Partitioning

Exact Match String Searching: The Knuth-Morris-Pratt Algorithm

The visitor pattern: OOP takes on the Expression Problem

Improving mgcLisp's define syntax

Separating the Men from the Boys, Knuth Style

Reducing rotations during deletion from AVL trees

Parsing Lisp: From Data To Code

Swiz Heaps: A deterministic modification of Randomized Meldable Heaps