Persistent Symbol Tables for Lexical Scoping
Symbol tables are a fundamental part of compilers. Depending on the compiler and the semantics of the language being compiled different symbol tables may be used for different phases of compilation, or one symbol table may be used through the entire compilation process. The primary purpose of symbol tables, as the name would imply, is to associate symbols such as variable names, procedure names, etc to a value, such as memory addresses. By far the most common data structure employed for this purpose are hash tables.
In addition to it's role of mapping names to values/addresses, symbol tables are also an important part of tracking the scope of a variable as well as the distance from it's declaring scope at any point. This is crucial if your language supports lexical scoping with nested procedures, or closures, and aids greatly during code generation or interpretation.
In todays post I'm going to cover one such method for implementing a symbol table that preserves a scopes state after the exiting that scope. The concept of how to implement such symbol table isnt particularly difficult, but it takes quite a number of different data structures working in concert to make it all happen efficiently. So without further ado, lets start getting some of these data structures set up how we need them.
Scoping with Persistence
In order to properly implement lexical scoping we need to construct the symbol table in such a way that after performing a preorder traversal on an abstract syntax tree, we are left with a symbol table that models the lexicographical structure of the code. This results in our symbol table being arranged like a tree, with the nodes being scopes, and the global scope being the root node of the tree.
Inorder for us to construct this kind of symbol table we are going to combine some of the functionality from three commonly used data structures:
- hash tables for binding names to addresses
- stacks for entering and exit scopes
- parent pointer trees for accessing enclosing scopes
class ScopingSymbolTable {
private:
Scope* scope;
int scopeDepth;
STEntry* get(string name);
Scope* insertProcedure(string name);
Scope* getProcedure(string name);
public:
ScopingSymbolTable() {
scope = new Scope();
scopeDepth = 0;
}
void insertVar(string name);
LocalVar* getVar(string name);
void openScope(string name);
void closeScope() ;
};
Our Interface is simple, we can open a new scope or existing scope by name, close the current scope, and Insert and retrieve information about local variables from the current scope. LocalVar's and Scopes are represented by their own data structures, which I'll go over next.
Representing Scopes In a Symbol Table
Each scope of our symbol table is it's own hashtable, from which we will be able to access variables local to that scope, find scopes which the current scope encloses as well as the scope which encloses the current scope and the variables local to it. As such each Scope structure has an array of STEntry objects, an object counter which is utilized for generating address offsets in stack frames, and a pointer to the scope which encloses it.
The hash function I've used is bernsteins hash which tends to work well in compilers, though you can use whichever hash function you prefer. I've used a table size that is a power of 2 so that we can use bitwise AND to obtain the table slot instead of modulo TABLE_SIZE.
For collision resolution I've gone with separate chaining so that we dont have to resize the table: search and insertion times might get a bit slower if entries start numbering into the thousands, which for a single scope isn't likely to happen. You can of course adjust TABLE_MAX for your own use case.
const int TABLE_MAX = 256;
int hashf(string name) {
int h = 5781;
for (char c : name) {
h = (33*h+ (int)c);
}
return h & (TABLE_MAX-1);
}
struct Scope {
int numEntries;
STEntry* table[TABLE_SIZE];
Scope* enclosing;
Scope() {
numEntries = 0;
enclosing = nullptr;
for (int i = 0; i < TABLE_SIZE; i++) {
table[i] = nullptr;
}
}
STEntry* lookup(string name) {
int idx = hashf(name);
for (STEntry* it = x->table[idx]; it != nullptr; it = it->next) {
if (it->name == name) {
return it;
}
}
return nullptr;
}
bool addVar(string name) {
int idx = hashf(name);
for (STEntry* it = scope->table[idx]; it != nullptr; it = it->next) {
if (it->name == name)
return false;
}
int addr = nextFreeAddress();
STEntry* nent = makeLocalVarEntry(name, addr, scopeDepth);
nent->next = scope->table[idx];
scope->table[idx] = nent;
return true;
}
Scope* insertProcedure(string name) {
int idx = hashf(name);
for (STEntry* it = scope->table[idx]; it != nullptr; it = it->next) {
if (it->name == name)
return it->procedure;
}
Scope* ns = new Scope();
STEntry* nent = makeProcedureEntry(name, ns);
nent->next = scope->table[idx];
scope->table[idx] = nent;
return ns;
}
};
int nextFreeAddress() {
if (scopeIsGlobal()) {
addr = localAddr;
localAddr -= 1;
} else {
addr = scope->numEntries;
scope->numEntries += 1;
}
return addr;
}
Search and insertion both use pretty "textbook" algorithms for a chained hash table. The only thing of note is that when adding an entry to the table, we first check the entries in the bucket the variable hashed to to avoid allocating the same name twice in the same scope. Having used chaining for the collision resolution, checking for collisions is as simple as walking a linked list.
Speaking of table entries, lets discuss representing variables and scopes in our symbol table.
Variable & Scope Information as Symbol Table Entries
To facilitate our collision resolution strategy, each entry has a pointer to another entry so that we can create a linked list. Any names in the same scope which hash to the same slot on the table get added to the list.
A tagged union is perfect for allowing our symbol table to be able to store information about either local variables or nested scopes efficiently. An enumerated type is used as a flag to indicate which member of the union is in use, with an additional flag to indicate an empty entry.
enum DefType {
VARDEF, PROCDEF, EMPTY
};
struct STEntry {
string name;
DefType type;
union {
LocalVar* localvar;
Scope* procedure;
};
STEntry* next;
STEntry(string n = "") : name(n), next(nullptr) { }
};
When storing local variables to our symbol table we need at the minimum an address to associate it with in the run time environment, we may additionally choose to store type information, or a depth off set (though this can be calculated on the fly).
struct LocalVar {
int loc;
int depth;
LocalVar(int l = 0, int d = 0) : loc(l), depth(d) { }
};
Finding Symbols
LocalVar* getVar(string name) {
STEntry* ent = get(name, VARDEF);
return ent->type == EMPTY ? nullptr:ent->localvar;
}
STEntry* get(string name, DefType type) {
Scope* x = scope;
int idx = hashf(name);
while (x != nullptr) {
STEntry* entry = x->lookup(name);
if (entry != nullptr && entry->type == type)
return entry;
x = x->enclosing;
}
return makeEmptyEntry();
}
Inserting Localvars
Inserting a Localvar is accomplished as a standard insertion into the current scopes hashtable. After hashing the name of the variable to be inserted, we check the current scope to make sure that name hasnt already been declared, otherwise we exit with a failure to insert. In my implementation, I choose this point to assign the address offset that will be associated to the variable name in the stack frame or global namespace.
void insertVar(string name) {
scope->insertVar(name);
}
Inserting a procedure works almost identically to inserting a localvar, the difference being that the value being inserted is a pointer to the scope referenced by the name we are inserting.
Scope* insertProcedure(string name) {
Scope* ns = scope->insertProcedure(name);
ns->enclosing = scope;
return ns;
}
Scope* getProcedure(string name) {
STEntry* ent = get(name, PROCDEF);
return ent->type == PROCDEF ? ent->procedure:nullptr;
}
Managing Scopes And Procedures
And finally we come to what we've been laying the groundwork for: Handling scope. Upon encountering a procedure definition or block (if your language is block scoped) we "open" a scope by first checking if there has already been a scope declared for the procedure name being referenced, and creating one if it does not yet exist. This is done by allocating a new hashtable with its enclosing pointer pointing to the current scope, and setting the newly instantiated hashtable as the current scope. The new current scope is added as an entry to the hashtable in it's enclosing scope.
void openScope(string name) {
Scope* st = getProcedure(name);
if (st == nullptr) {
st = insertProcedure(name);
}
st->enclosing = scope;
scope = st;
scopeDepth++;
}
When we exiting a scope or returning from procedure or function we close the scope by following the enclosing pointer to the enclosing scope, making it the current scope. The only time this isnt the case is when we are already at the lowest scope level.
void closeScope() {
if (scope->enclosing != nullptr) {
scope = scope->enclosing;
scopeDepth--;
}
}
The Result
I used the same implementation above in my pcode compiler, lets run a copuple tests and see how it shakes out:
let j := 13;
let k := 42;
[LET_STMT] [TK_ID, j]
[CONST_EXPR] [TK_NUM, 13]
[LET_STMT] [TK_ID, k]
[CONST_EXPR] [TK_NUM, 42]
Symbol Table:
j: Localvar: 5000, Scalar.
k: Localvar: 4999, Scalar.
And some thing a bit more involved:
func A(x, y) {
let m := 1;
let n := 2;
func B(var q) {
let t := 7
return q + t;
};
return B(m+x) * n+y;
}
[EXPR_STMT] [TK_AMPER, func]
[LAMBDA_EXPR] [TK_ID, A]
[LET_STMT] [TK_ID, x]
[LET_STMT] [TK_ID, y]
[LET_STMT] [TK_ID, m]
[CONST_EXPR] [TK_NUM, 1]
[LET_STMT] [TK_ID, n]
[CONST_EXPR] [TK_NUM, 2]
[EXPR_STMT] [TK_AMPER, func]
[LAMBDA_EXPR] [TK_ID, B]
[LET_STMT] [TK_ID, q]
[LET_STMT] [TK_ID, t]
[CONST_EXPR] [TK_NUM, 7]
[RETURN_STMT] [TK_RETURN, return]
[BINOP_EXPR] [TK_ADD, +]
[ID_EXPR] [TK_ID, q]
[ID_EXPR] [TK_ID, t]
[RETURN_STMT] [TK_RETURN, return]
[BINOP_EXPR] [TK_ADD, +]
[BINOP_EXPR] [TK_MUL, *]
[FUNC_EXPR] [TK_ID, B]
[BINOP_EXPR] [TK_ADD, +]
[ID_EXPR] [TK_ID, m]
[ID_EXPR] [TK_ID, x]
[ID_EXPR] [TK_ID, n]
[ID_EXPR] [TK_ID, y]
A: Procedure: A
B: Procedure: B
q: Localvar: 0, Scalar.
t: Localvar: 1, Scalar.
m: Localvar: 2, Scalar.
n: Localvar: 3, Scalar.
x: Localvar: 0, Scalar.
y: Localvar: 1, Scalar.
j: Localvar: 5000, Scalar.
k: Localvar: 4999, Scalar.
Well, thats all i've got for today, so until next time Happy Hacking!
For an example of this style of symbol table in action, I use this exact method in my DALGOL p-code compiler
-
Parsing Array Subscript Operators with Recursive Descent
-
Map & Filter in Scheme & C
-
Persistent Symbol Tables for Lexical Scoping
-
The Festival of 1 + n + f(n-1) Lights
-
The Heart of Pratt Parsing: Top-Down Operator Precedence
-
Compiling expressions to P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
Leave A Comment