Parsing Array Subscript Operators with Recursive Descent
Well, It looks like I'm back with another Recursive Descent post. This time I'm going to learn you how to parse those pesky Nth dimensional array references and deeply nested object members using recursive descent. So grab some coffee, strap in, and lets learn how to take a piece of code like this:
arr[1][0][3]
And construct an AST like this:
[subscript expression][TK_LSQUARE, '[' ]
[subscript expression][TK_LSQUARE, '[' ]
[subscript expression][TK_LSQUARE, '[' ]
[id expression][Scope: global][TK_ID, 'arr' ]
[const expression][TK_NUM, 1 ]
[const expression]][TK_NUM, 0 ]
[const expression][TK_NUM, 3 ]
Let's do it.
The Grammar
The following expression grammar has everything that we need (and more) to explore parsing array subscripts and object members, but it helps to see where they fall in the overall expression grammar before we zoom in on the part we want to focus on.
Using the following EBNF grammar, we can see that the subscript operator has the highest precedence level, on par with the dot operator, which is used for accessing object members, and function calls.
assignexpr := simpleExpr { ':=' simpleExpr }*
simpleExpr := relExpr { == | != relExpr }*
relExpr := expr { < | > | <= | >= expr }*
expr := term { +|- term }*
term := factor { *|/ factor }*
factor := - factor | ! factor | subsExpr
subsExpr := primary { '(' argsList ')' | { '[' expr ']' | '.'primary }* }
primary := ID | NUM | (expr)
argsList := expr { , expr }*
You would certainly be forgiven for mistaking the subscript operator as a postfix operator at first glance. I mean, we stick it on the end of the Identifier dont we? Well, yes we do, but operator[] is actually parsed as an infix operator. We begin parsing a subscript operator with a call to primary(), which if the expression being parsed turns out to be a subscript expession, will be come the left child of the AST.
Primary is the 'bottom' of the grammar, which means its corresponding function is the 'end' of the call chain: regardless of If we match anything in primary or not, we start unwinding back up the call stack (or climbing back down the precedence ladder, depending on how you want to think about it).
astnode* Parser::primary() {
astnode* node;
switch (currSym()) {
case TK_ID: {
node = makeExprNode(ID_EXPR, current);
match(TK_ID);
}
break;
case TK_NUM: {
node = makeExprNode(CONST_EXPR, current);
match(TK_NUM);
}
break;
case TK_LPAREN: {
match(TK_LPAREN);
node = simpleExpr();
match(TK_RPAREN);
}
default:
break;
}
return node;
}
In the case of parsing an array element reference, the call to primary() should return an ID_EXPR node for the name of the array being accessed. At the moment, this would be the root of the AST under construction:
[id expression][Scope: global][TK_ID, 'arr' ]
Parsing Subscript Operators
The above grammar handles the case for a possible function call before handling the possibility of an subscript operator or dot operator, and so by convention so will we. The above grammar does not account for the possibility of a "multidiminesional function call", or something like function(1,2)(3,4) and while this situation is incredibly rare, if you do want to support it, it can be done by replacing the if statement with a while loop like is done for the subscript operator and dot operator.
astnode* Parser::subscript() {
astnode* m = primary();
if (currSym() == TK_LPAREN) {
m->exprType = FUNC_EXPR;
match(TK_LPAREN);
if (currSym() == TK_RPAREN) {
match(TK_RPAREN);
return m;
} else {
m->child[1] = argsList();
match(TK_RPAREN);
}
}
while (currSym() == TK_LSQUARE) {
astnode* t = makeExprNode(SUBSCRIPT_EXPR, current);
match(TK_LSQUARE);
t->child[0] = m;
m = t;
m->child[1] = primary();
match(TK_RSQUARE);
}
while (currSym() == TK_PERIOD) {
astnode* t = makeExprNode(OBJECT_DOT_EXPR, current);
match(TK_PERIOD);
t->child[0] = m;
m = t;
m->child[1] = primary();
}
return m;
}
The subscript() procedure uses tree re-writing and tail call elimination to process N-dimensional array references. As you will recall from a above, when the first call to primary() returns, it returns the ID_EXPR Node it created, which is assigned to the 'm' node pointer and logically, is the root of our AST.
Continuing with the parse process, the current symbol to match is the open left brace of a subscript expression, which we know how to handle.
while (currSym() == TK_LSQUARE) {
astnode* t = makeExprNode(SUBSCRIPT_EXPR, current);
match(TK_LSQUARE);
t->child[0] = m;
m = t;
m->child[1] = primary();
match(TK_RSQUARE);
}
A subscript expression node 't' is created, and we promote it to being the new root node by making the 'm' node the left most child of the new 't' node, and assigning 't' to 'm'. At this point, the AST has gone from looking like this:
[id expression][Scope: global][TK_ID, 'arr' ]
And transformed into looking like this:
[subscript expression][TK_LSQUARE, '[' ]
[id expression][Scope: global][TK_ID, 'arr' ]
We then assign the result of a call to simpleExpr() to the right child, which is what is the expression to be evaluated to determine which element to access, yielding us an AST for a one dimensional array reference:
[subscript expression][TK_LSQUARE, '[' ]
[id expression][Scope: global][TK_ID, 'arr' ]
[const expression][TK_NUM, 1 ]
On the next iteration of the loop the process starts a new, with the tree shown above becoming the leftmost child of a new SUBSCRIPT_EXPR node, with a new right node created for that dimensions subscript:
[subscript expression][TK_LSQUARE, '[' ]
[subscript expression][TK_LSQUARE, '[' ]
[id expression][Scope: global][TK_ID, 'arr' ]
[const expression][TK_NUM, 1 ]
[const expression]][TK_NUM, 0 ]
This process repeats to N dimensions. The keen eyed among you may have noticed the dot operator works teh same, and as such creating multi level object references is performed in much the same way, though ill leave that to you to convince yourself of.
That's all I've got for today, so until next time Happy Hacking!
-
Parsing Array Subscript Operators with Recursive Descent
-
Map & Filter in Scheme & C
-
Persistent Symbol Tables for Lexical Scoping
-
The Festival of 1 + n + f(n-1) Lights
-
The Heart of Pratt Parsing: Top-Down Operator Precedence
-
Compiling expressions to P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
Leave A Comment