Parsing Lisp: From Data To Code
Normally when discussing parsing were concerned with assosciativity and operator precedence for constructing Abstract Syntax Trees using either recursive descent, or some stack driven algorithms for subsuming LL(1) or LALR(1) based grammars. When it comes to Lisp however, the process is actually much closer in nature and feeling to performing Lexical Analysis - the tokenization stage - of other languages. This is because where most modern (and not so modern!) programming languages use infix operators Lisp has instead opted for prefix operators which are no different from any other function call.
Learning lisp is often spoken of tongue-in cheekly as being a "spirtual" experience, akin to attaining "programming enlightenment". In that regard, parsing lisp expressions is somewhat of a "transendental experience": the first time you do it, you start to grasp what homoiconicity is really about.
Atoms and Lists
The goal of parsing an s-expression is to convert the textual representation into a list of Atoms. Specifically, we are interested in converteing symbols, numbers, and lists from their textual context to their object representation. The following structures are used to implement atoms and lists.
enum AtomType {
AS_NUM, AS_SYMBOL AS_BINDING, AS_FUNCTION,
AS_LIST, AS_NIL, AS_ERROR, AS_STRING
};
typedef struct {
enum AtomType type;
union {
int intval;
String* stringval;
List* listval;
Function* funcval;
Binding* bindval;
};
int mark;
} Atom;
typedef struct listnode_ {
Atom* info;
struct listnode_* next;
} listnode;
typedef struct List {
listnode* head;
listnode* tail;
int count;
} List;
I choose to implement lists using an underlying coincrete linked list. Other implementation strategies exist which involve using a concrete cons cell structure like the following instead:
typedef struct {
Atom* car_;
Atom* cdr_;
} ConsCell;
While this is certainly a viable method, and perhaps a bit closer to a "pure" interpretation, but it adds a degree of complexity over the first method.
From Strings to Lists
The basic strategy to convert from a textual context to object context is to loop over the input string and when we encounter an opening parentheses we call the parse function recursively, when we encounter a close paren, we return the current list being built, and otherwise we convert numbers to number Atoms and Symbols to Symbol Atoms... And that's pretty much it. In fact, that description is a pretty faithful description of the actual code its self:
int t = 0;
List* stringToList(char* buff) {
List* result = createList();
int i = 0;
bool quoted = false;
if (buff[i] == '(') { i++; }
for (; buff[i];) {
if (shouldSkip(buff[i])) { i++; continue; }
if (buff[i] == '\'') { quoted = true; i++; }
if (isalpha(buff[i]) || isSpecialChar(buff[i])) {
result = addToList(result, parseSymbol(buff, &i), quoted);
} else if (isdigit(buff[i])) {
result = addToList(result, parseNumber(buff, &i), quoted);
} else if (buff[i] == '(') {
result = addToList(result, makeListAtom(stringToList(buff+i)), quoted);
i = i+t+1;
} else if (buff[i] == ')') {
t = i;
return result;
}
quoted = false;
}
return result;
}
As we would while tokenizing any language, white space characters are discarded (they're useful for human readers, but not needed for computers). Whitespace characters include blank spaces, tabs, as well as new line and carriage return characters.
bool shouldSkip(char c) {
return c == ' ' || c == '\t' || c == '\r' || c == '\n';
}
Next, we need to be able to identify acceptable symbols for identifiers, including primitives like + and /. Compared to most languages Scheme has a very small set special characters, small enough to fit in a single (ugly) boolean expression.
bool isSpecialChar(char c) {
return (c == '-' || c == '?' || c == '+' || c == '!' || c == '*' || c == '/' || c == '\'' || c == '&' || c == '%' || c == ':');
}
As mentioned above, from the textual context the only things we need to recognize are lists, symbols, and numbers. Symbols and numebrs are tokenized in the usual way: Notice how the index variable is passed by reference so our position in the input string is reflected back in the calling procedure.
Atom* parseSymbol(char buff[], int* i) {
int k = *i;
while (isalpha(buff[k]) || isSpecialChar(buff[k]))
k++;
int len = k - *i;
char* sub = strndup(buff+*i, len);
*i += len;
return makeSymbolAtom(makeString(sub, len));
}
Atom* parseNumber(char buff[], int* i) {
int val = 0, j = *i;
while (isdigit(buff[j])) {
val = 10 * val + (buff[j++]-'0');
}
*i=j;
return makeIntAtom(val);
}
When constructing the list, the one other thing we need to pay attention to is whether values have been quoted or not, as we need to preserve such information.
List* addToList(List* addTo, Atom* item, bool isquoted) {
if (isquoted) {
List* nl = createList();
nl = appendList(nl, makeSymbolAtom(makeString("'", 1)));
nl = appendList(nl, item);
item = makeListAtom(nl);
}
addTo = appendList(addTo, item);
return addTo;
}
Once we have our list, it is up to Eval in conjunction with the runtime environment to determine the meaning of the various symbols, as covered in my previous post on the eval/apply cycle.
Thats all I've got for today, Happy Hacking!
-
Parsing Lisp: From Data To Code
-
Swiz Heaps: A deterministic modification of Randomized Meldable Heaps
-
Let's talk Eval/Apply
-
BST Deletion: Removal By Merge
-
Dictionary Based Compression: The LZW Algorithm
-
Taking Action: Compiling Procedures to P-Code
-
Making Decisions: Compiling If Statements to P-Code
-
Repeating yourself: Compiling While Loops to P-Code
-
Removing an entry from a B+ Tree without Rebalancing: A viable approach?
-
Implementing An Iterator for In-Memory B-Trees
Leave A Comment