Generating P-Code by AST Traversal

Initially designed in the 1970's as a "portable assembly" language to facilitate implementing Pascal compilers on the varying computer architectures of the day, P-Code is the progenator of all modern bytecode interpreters. The success of the UCSD Pascal system inspired others to adopt the concept, including Microsoft, who's early Visual Studio compilers for Visual Basic and Visual C++ both compiled to their own propietary form of P-Code. Probably the best known bitecode interpreter is the Java Virtual Machine. Java's early advertising slogan was "write once, run anywhere" - a concept championed by the earlier P-Code compilers.

Todays post is the first in a series, in whcih I am going to discuss the basics of generating P-Code from an abstract syntax tree. This post assumes that you already have a way to generate the AST. If not, now would be a good time to read my posts on recursive descent or the shunting yard algorithm. Up until now, my posts on compilers and interpreters have primarly focused on the front end phases. It is now that we turn our attention to the back end. So without further ado, let's get to it.

The P-Machine And Instruction Set

Despite being around in one form or another since the 1970's, their is no "P-Code Standard". There are several implementations of varying degree of similarity. There are the aformentioned UCSD and Microsoft implementations. In addition, many books on compilers include high level descriptions of p-code. Kenneth Louden's "Compiler Construction" contains a fairly good treatment of the subject. Wirth, the original author of the pascal language published a simplified version he dubbed pl/0 which he used for teaching, the original pascal code for which can be found on the pl/0 wikipedia page.

The P-Code Instruction set differs from modern byte code in that it was originally intended to be represented/transmitted as plain text. This is differs from the more modern "byte code", so named as each instruction is represented by a single byte and is much more machine oriented. Like most stack machines, the instruction set for p-code is both compact and simple. It is comprised of a few basic types of instructions, whos mnemonic and function I will briefly cover below.

P-Code instructions can be thought of a 3-tuple, where the first elenment is required, and the second two are optional depending on the instruction. When written, p-code generally appears in the following form:

(Instruction, Operand, Nest Level)

The instruction is as the name implies, the operation to be performed. The operand can be either a constant value, or a memory address, and the nest level is used to determine scope in the case of nested procedure definitions, and can be ignored (for now).

 

P-Code Instructions

Instructions for the P-Code machine can be grouped into categories: load/store operations for retrieving and storing data from memory and the stack, branching instructions for handling control flow, binary operators, relative operators, and calling sequence instructions for performing procedure calls, and System Instructions for managing I/O and the stack.

Load/Store Instructions

LDC - Load constant, pushes a value onto the stack
LOD - Load from address, pushes the value at the supplied address on to the stack.
LDA - Load Address, pushes the memory location of the supplied variable name on to the stack.

STO - Store
STN - Store non destructive
STP - Same as store, but with stack slots reversed.


Branching Instructions

Branching instructions are used for control flow, meaning they are used to implement if/else statements and loops.

LAB - label, for assisting in control flow
JMP - unconditional branch to supplied address/label.
JPC - jump to supplied address/label if the value on the to of the stack is logically false.

Calling Sequence Instructions

Don't worry about these for now, these are for another day.

MST - Mark stack, Allocates space on the stack for a new activation record.
CAL - calls a procedure/function
ENT - marks the entry point of a procedure
RET - return from a procedure

Binary Operators

The binary operators work by removing the top two items on the stack for use as operands, and replacing them with the result of applying the operator to the operands. The exception is the negation operator, which is a unary operator and thus only operates on the top item.

ADD - Addition
SUB - Subtraction
MUL - Multiplication
DIV -Division
MOD - Modulo (remainder)
NEG - Negate - *this is a Unary operator*

Relative Operatos

Relative operators are also binary operators: they consume the top two items on the stack, and replaces them with the result of applying the operator.

LT - less than
LTE - less than or equal
GT - greater than
GTE - greater than or equal
EQU - equal
NEQ - not equal

System Instructions

TS - Returns the address of the top of the stack operators, used for determining the stack size as well as address offsets during the setting up of activation records.
INC - Increment the top of stack register
DEC - Decrement the top of stack register
PRINT - Display the item on the top of the stack to STDOUT
HALT - Stop all execution

Now that we have an understanding of the p-machine and its instruction set, we can use  it as a target for compilation. Let's turn our attention to automating the generation of P-Code from an AST. 

The P-Machine Memory layout

P-Code uses retargetable addresses, meaning the addresses are relative to another position on the stack. There are three sections of memory in the pmachine (four actually, but I'm not counting the codepage). The three sections of memoy in the P-Machine are:

- the global address space - starting at location 6000, growing downward to location 5000.

- Heap space - fstarts at memory address 5000, and grows downwards towards the top of the stack.

- Stack space - starting at memory address 0 and growing upwards towards the bottom of heap memory, or location 5000, which ever comes first. When the top of stack pointer meets the bottom of the heap, all available memory has been consumed.

Compiling to P-Code

One of the design requirements of P-Code is that it is meant to be automatically generated by compilers, as well as be easy to execute on p-machine interpreters. That's not to say it isn't also supposed to be human readable and possible to hand code by programmers as well. P-Code was designed with pascal in mind, and as such, the instructions and memory layout are amenable to one-pass compilation.

Many early compilers skipped the explicit generation of an abstract syntax tree, and opted instead to directly output p-code from the parser. This was mainly due to the memory limitations of the hardware available at the time. Modern computers have no such limitations, I choose to generate an AST structure and perform a multi-pass compilation, with separate passes for creating the AST, building the symbol table, and code generation. The following node structure is the one i used to create AST's.

const int MAXCHILD = 3;
const int LEFTCHILD = 0;
const int RIGHTCHILD = 1;

struct ASTNode {
    NodeKind nk;
    union {
        ExprType expr;
        StmtType stmt;
    } type;
    Token data;
    ASTNode* next; 
    ASTNode* child[MAXCHILD];
    ASTNode() {
        next = nullptr;
        for (int i = 0; i < MAXCHILD; i++)
            child[i] = nullptr;
    }
};

The process of code generation is tightly coupled to the design of the abstract syntax tree. So any changes to the AST can lead to quite drastic changes in the code generation procedures. The actual code generation is done by traversing the tree and calling the proper helper function based on the node kind, and type.

The entrance for all code generation is the following procedure, creatively named genCode(). It performs the first step of generating code by determining if we are handling a statement, or an expression, and forwarding us to the appropriate procedure. 

void genCode(ASTNode* node) {
       if (node != nullptr) {
              switch (node->kind) {  
                     case EXPR_NODE:  genExpr(node);
                     case STMT_NODE:  genStmt(node);
              }
              genCode(node->next);
       }
}

At a high level, our AST is a list of statements and expressions, chained together by the next pointer. Once a statement node or expression node has been handled,  including recursively processing their child nodes, we move to the next statement and the process begins again.

Representing and Emitting Instructions

The tuples we used for textually representing P-Machine instructions is implemented Internally with the instructions being designated by an enumerated type, and it's operand and offset fields using the same Value type used by the P-Machine. The tuple is reprsented by the following data structure:

enum Inst {
    LDC, LDA, LOD, LRP, LDP, LDI, LDF, IXA,
    STO, STN, STP, LAB, MST, ENT, CAL, 
    RET, JMP, JPC, NEG, 
    ADD, SUB, MUL, DIV, MOD, 
    EQU, NEQ, LTE, GTE, LT, GT, 
   TS, INC, DEC, PRINT, HALT
};

struct Instruction {
    Inst instruction;
    Value operand;
    Value nestlevel;
    Instruction(Inst i = HALT, Value a = makeInt(0) , Value b = makeInt(0)) : instruction(i), operand(a), nestlevel(b) { }    
};

Instructions are stored together in an array named the codepage, which is indexed by the instruction pointer of the P-Machine. The list of instructions is collated in a vector by the codegenerator, and the position tracked by the cPos variable.

        vector<Instruction> codepage;
        int cPos;
        void emit(Inst op, Value operand, Value nestLevel) {
            codepage[cPos] = Instruction(op, operand, nestLevel);
            cPos++;
        }
        void emit(Inst op, Value operand) {
            codepage[cPos] = Instruction(op, operand);
            cPos++;
        }
        void emit(Inst inst) {
            codepage[cPos] = Instruction(inst);
            cPos++;
        }

The codepage vector is returned by the code generator to be executed by the P-Machine.

Evaluating Expressions with P-Code

As a warm up to help get accustomed to generating P-Code, lets take a look at evaluating simple expressions. Being a stack oriented architecture, P-Machines evaluate expressions in postfix.  As such, when using binary operators it is expected that the operands appear on the top of the stack. 

Evaluating: '13 + 29'

AST: 
       ( + )
       /    \
   (13)   (29) 

P-Code              Stack
------------------------------
(LDC, 13, 0)       [ 13 ]    
(LDC, 29, 0)       [ 13, 29 ]
(ADD, 0, 0)        [ 42 ]

Having determined the node being processed is an expression node, we arrive at the genExpr() expression, which will aid in figuring out the type of expression we are evaluating. As part of our warm up, we can handle binary operators, unary minus, and numerrical constants. This is enough to implement a simple calculator, and will help us get a feel for how code generation works.

void genExpr(ASTNode*  node) {
       if (node != nullptr) {
              switch (node->expr.type) {
                     case BINOP_EXPR: {
                            genBinOpExpr(node); break;
                      } break;
                      case UNOP_EXPR: {
                           genExpr(node->child[LEFTCHILD]);
                           emit(NEG);
                      } break;
                     case CONST_EXPR: {
                          emit(LDC, makeInt(node->data.strval), 0);
                     } break;
              }
       }
}

CONST_EXPR nodes, containing constant values generate LDC (Load Constant) instructions, which has the effect of the value the node represents being placed on the stack. Binary operators require their operands to already be on the stack when they are applied.  To ensure we generate the correct instruction sequence from the AST, we employ a post order traversal by calling genExpr() recursively on our child nodes, as they will either be CONST_EXPR nodes, or more BINOP_EXPR or UNOP_EXPR nodes. Take note of the indirect recursion being used for processing the children - we will be making use of this type of recursion alot as we move on to more complex code generation. 

void genBinOpExpr(ASTNode* node) {
       if (node != nullptr) {
               genExpr(node->child[LEFTCHILD]);
               genExpr(node->child[RIGHTCHILD]);
               switch (node->attributes.symbol) {
                     case TK_ADD:  emitOp("ADD"); break;
                     case TK_SUB:  emitOp("SUB"); break;
                     case TK_MUL:  emitOp("MUL"); break;
                     case TK_DIV:  emitOp("DIV"); break;
                    default: break;
               };
       }
}

At the moment the only unary operator being implemented is unary minus so as to facilitate supporting negative numbers. Unary Expressions are processed by recurring on their child, and then issuing a NEG (negate) instruction to apply the (-) oeprator to it's single argument.

Now lets expand our code generator to include named variables and assignment, but to  do that, were going to need to have a small chat about symbol tables.

Symbol Tables for Code Generation, Take 1

Symbol tables and Compilers are generally tightly coupled through several - or all - pases of compilation. For todays post where we wont be concerning ourselves with scoping or constrol statements, a simple map of variable names to memory locations in the global address space is all thats required. Still, we will laydown some of the framework needed for our symbol table to handle scoping as well.

Having strings for keys, the obvious and indeed the standard choice of data structure for a compilers symbol table is a hash table. I implemented a chained table using table size that is a power of 2. This allows me to determine table indices by using bitwise AND instead of modulus, which can offer a decent boost in performance. The hash function, Bernsteins hash, is a popular choice for compilers, offering a good distribution of entries.

const int TABLE_MAX = 256;

int hashf(string name) {
       int h = 5781;
       for (char c : name) {
             h = (33*h+ (int)c);
       }
       return h & (TABLE_MAX-1);
}

For now, the only thing our entries will be storing is the identifiers and the associated memory location to be found. Being a chained table for collision resolution, each entry also contains a next pointer for creating a linked list of entries which hash to the same slot.

struct STEntry {
       string name;
       int location;
       STEntry* next;
       STEntry(string id, int addr) : name(id), location(addr) { }
};

Speaking of linked lists, our symbol table is actually a linked list of hash tables. This is because when we add procedures and scoping they will be implemented by each scope having it's own symbol table, reachable from the "global" symbol table, which represents the top scope level.

struct Scope {
       STEntry* table[TABLE_MAX];
       int numEntries;
       Scope* enclosing;
       Scope() {
            for (int i = 0; i < TABLE_MAX; i++)
                 table[i] = nullptr;
            enclosing = nullptr;
       }
};

For now though, we need only concern ourselves with the top level scope, and save procedure calls for another day. As such, the only two operations we need to support (for now) are search and insertion into the top scope.

class SymbolTable {
     private:
          Scope* scope;
          int globalAddr;
     public:
          SymbolTable() {
                  globalAddr = 5000;
                  scope = new Scope();
                  scope->enclosing = nullptr;
          }
          void insert(string key) {
               int idx = hashf(name);
               for (STEntry* it = scope->table[idx]; it != nullptr; it = it->next) {
                    if (it->name == name)
                         return;
               }
               STEntry* nent = new STEntry(key, --globalAddr, scope->table[idx]);
               scope->table[idx] = nent;
               scope->numEntries++;
          }
          int get(string key) {
               int idx = hashf(name);
               for (STEntry* it = scope->table[idx]; it != nullptr; it = it->next) {
                    if (it->name == name)
                         return scope->table[idx]->location;
               }
               return -1;
          }
};

Now that we have the ability to associate names with memory addresses, lets add that capability to the compiler itsself.

Compiling ID Expressions and Assignment

Variable names are a type of expression, which evaluate to the value they represent. As such, variable names are represented in the AST as EXPR_NODES nodes of type ID_EXPR. They are handled exactly the same as constant values until we get to genExpr(), where we add a case statement for ID_EXPR to the code shown above. We have also introduced a new parameter to the genExpr() procedure, isLValue. This parameter is used when handling ID_EXPR's and is for the purpose of determining which side of an assignment the variable is on, as that also determines which instruction to emit. If we are on the left side of an assignment, we emit a load address instruction, and if are on the right side we emit a load from address instruction.  In other words, are we interested in the value represented by the name, or the location.

void genExpr(ASTNode*  node,  bool isLValue) {
       if (node != nullptr) {
              switch (node->expr.type) {
                     case BINOP_EXPR: 
                    /*
                        continues same as above....
                  */
                   case ID_EXPR: {
                        int addr = st.get(node->data.strval);
                        if (addr == -1) {
                                cout<<"Error: undeclared variable in use."<<endl;
                                return;
                        }
                         if (isLValue) {
                            emit(LDA, makeInt(addr));
                        } else {
                           emit(LOD, makeInt(addr));
                       }
                   } break;
              }
       }
}

Some languages make assignment a statement, others make it an expression. When assignment is an expression, it makes the assignment operator a binary operator witht the highest level of precedence. Like the other binary operrators, the assignment operator expects its two operands to already be on the stack, and as such we perform the familiar postorder traversal, this time taking note of the isLValue flag:

void genExpr(ASTNode*  node,  bool isLValue) {
       if (node != nullptr) {
              switch (node->expr.type) {
                    case ID_EXPR: 
                    /*
                        continues same as above....
                    */
                    case ASSIGN_EXPR: {
                           genExpr(node->child[LEFTCHILD], true);
                           genExpr(node->child[RIGHTCHILD], false);
                           emit(STO);
                   } break;
              };
       }
}

Ok, so by now you're probably wondering if the code generator gets the memory locations from the symbol table by looking up the identifiers in the symbol table, then how did they get in the symbol table? There are (at least) two ways we can go about populating the symbol table, one option open to us is the one employed by one-pass compilers, and add the names to the symbol table as we encounter them during parsing.

This is a perfectly reasonable thing to do, but it does make some features we may want to add later more difficult to implement. Seeing as we're already making a separate pass for our code generation we may as well extract building the symbol table to it's own pass as well to simplify things for us as much as possible. Speaking of making things simple for us, we will borrow one concept from the one-pass compilers: declare before use. Making the language mandate that programmers declare all variables ensures that we'll always be able to find the refered to variable in the symbol table.

A Symbol Table Building Traversal

Building the symbol table is done via a preorder traversal of the AST. Unlike for codegeneration, there is no complex mutual recursive calls, or conditional flags to pass between calls. We also have far, far fewer cases to handle than we do for code generation. 

Because we are currently only  concerning ourselves with a single namespace and no procedures, the only cases we need to handle in our symbol table building procedure is the let statement, for declaring variable names, which is handled by inserting the identifier into the symbol table if it doesnt exist, and ID expressions, which simply requires we check the name exists in the symbol table, and report the error if it does not.

void buildST(ASTNode* node) {
      if (node != nullptr) {
            switch (node->nk) {
                  case STMT_NODE: {
                         switch (node->type.stmt) {
                                case LET_STMT: { 
                                   if (st.get(node->data.strval) == -1) {
                                        st.insert(data.strval);
                                   }
                                } break;
                                default: break;
                         }
                  } break;
                 case EXPR_NODE: {
                        switch (node->type.expr) {
                               case ID_EXPR: {
                                     if (st.get(node->data.strval) == -1) {
                                              cout<<"Error: undeclared variable in use"<<endl;
                                     }
                               } break;
                        }
                 } break;
            }
           for (int i = 0; i < MAX_CHILD; i++) {
                 buildST(node->child[i]);
           }
           buildST(node->next);
      }
}

Armed with our symbol table and code generation routines we can put everything together to begin compiling our source do p-code instructions. 

Putting It All Together

We're finally in a position where we can begin compiling simple straight line programs, simple as they may be. Let's take our compiler for a little test drive of what we've got so far by compiling the following code. 

let a := 13;
let b := 29;
let ans := a + b;

As previously stated the jop of our compiler is to take the AST generated by the parser and produce instructions for the P-Machine. We first to a symbol table building pass over the AST, followed by the actual code generation pass. The result is stored in a vector, and returned from the compiler for execution.

vector<Instruction> compile(ASTNode* ast) {
       buildST(ast);
       genCode(ast);
       return pcode;
}

The example listed above, after being processed by our compiler will output the following pcode instructions, which can than be executed by the virtual machine:

( LDA, 5000, 0)
( LDC, 13.000000, 0)
( STN, 0, 0)
( LDA, 4999, 0)
( LDC, 29.000000, 0)
( STN, 0, 0)
( LDA, 4998, 0)
( LOD, 5000, 0)
( LOD, 4999, 0)
( ADD, 0, 0)
( STN, 0, 0)
(HALT, 0, 0)

Ok, I think this is where I will stop for today, we've covered alot in todays post. In the next post I will cover control statements and loops, until then Happy Hacking!


Leave A Comment