Parsing Lisp: From Data To Code

mgcadmin 19-04-2025

Normally when discussing parsing were concerned with assosciativity and operator precedence for constructing Abstract Syntax Trees using either recursive descent, or some stack driven algorithms for subsuming LL(1) or LALR(1) based grammars. When it comes to Lisp however, the process is actually much closer in nature and feeling to performing Lexical Analysis - the tokenization stage - of other languages. This is because where most modern (and not so modern!) programming languages use infix operators Lisp has instead opted for prefix operators which are no different from any other function call.

Learning lisp is often spoken of tongue-in cheekly as being a "spirtual" experience, akin to attaining "programming enlightenment". In that regard, parsing lisp expressions is somewhat of a "transendental experience": the first time you do it, you start to grasp what homoiconicity is really about.

Atoms and Lists

The goal of parsing an s-expression is to convert the textual representation into a list of Atoms. Specifically, we are interested in converteing symbols, numbers, and lists from their textual context to their object representation. The following structures are used to implement atoms and lists.

enum AtomType {
    AS_NUM, AS_SYMBOL AS_BINDING, AS_FUNCTION, 
   AS_LIST,  AS_NIL, AS_ERROR, AS_STRING
};

typedef struct {
    enum AtomType type;
    union {
        int intval;
        String* stringval;
        List* listval;
        Function* funcval;
        Binding* bindval;
    };
    int mark;
} Atom;

typedef struct listnode_ {
    Atom* info;
    struct listnode_* next;
} listnode;

typedef struct List {
    listnode* head;
    listnode* tail;
    int count;
} List;

I choose to implement lists using an underlying coincrete linked list. Other implementation strategies exist which involve using a concrete cons cell structure like the following instead:

typedef struct {
     Atom* car_;
     Atom* cdr_;
} ConsCell;

While this is certainly a viable method, and perhaps a bit closer to a "pure" interpretation, but it adds a degree of complexity over the first method.

From Strings to Lists

The basic strategy to convert from a textual context to object context is to loop over the input string and when we encounter an opening parentheses we call the parse function recursively, when we encounter a close paren, we return the current list being built, and otherwise we convert numbers to number Atoms and Symbols to Symbol Atoms... And that's pretty much it. In fact, that description is a pretty faithful description of the actual code its self:

int t = 0;
List* stringToList(char* buff) {
    List* result = createList();
    int i = 0;
    bool quoted = false;
    if (buff[i] == '(') { i++; }
    for (; buff[i];) {
        if (shouldSkip(buff[i])) { i++; continue; }
        if (buff[i] == '\'') {  quoted = true; i++; }
        if (isalpha(buff[i]) || isSpecialChar(buff[i])) {
            result = addToList(result, parseSymbol(buff, &i), quoted);
        } else if (isdigit(buff[i])) {
            result = addToList(result, parseNumber(buff, &i), quoted);
        } else if (buff[i] == '(') {
            result = addToList(result, makeListAtom(stringToList(buff+i)), quoted);
            i = i+t+1;
        } else if (buff[i] == ')') {
            t = i;
            return result;
        }
        quoted = false;
    }
    return result;
}

As we would while tokenizing any language, white space characters are discarded (they're useful for human readers, but not needed for computers). Whitespace characters include blank spaces, tabs, as well as new line and carriage return characters.

bool shouldSkip(char c) {
    return c == ' ' || c == '\t' || c == '\r' || c == '\n';
}

Next, we need to be able to identify acceptable symbols for identifiers, including primitives like + and /. Compared to most languages Scheme has a very small set special characters, small enough to fit in a single (ugly) boolean expression.

bool isSpecialChar(char c) {
    return (c == '-' || c == '?' || c == '+' || c == '!' || c == '*' || c == '/' || c == '\'' || c == '&' || c == '%' || c == ':');
}

As mentioned above, from the textual context the only things we need to recognize are lists, symbols, and numbers. Symbols and numebrs are tokenized in the usual way: Notice how the index variable is passed by reference so our position in the input string is reflected back in the calling procedure.

Atom* parseSymbol(char buff[], int* i) {
    int k = *i;
    while (isalpha(buff[k]) || isSpecialChar(buff[k]))
        k++;
    int len = k - *i;
    char* sub = strndup(buff+*i, len);
    *i += len;
    return makeSymbolAtom(makeString(sub, len));
}

Atom* parseNumber(char buff[], int* i) {
    int val = 0, j = *i;
    while (isdigit(buff[j])) {
        val = 10 * val + (buff[j++]-'0');
    }
    *i=j;
    return makeIntAtom(val);
}

When constructing the list, the one other thing we need to pay attention to is whether values have been quoted or not, as we need to preserve such information.

List* addToList(List* addTo, Atom* item, bool isquoted) {
    if (isquoted) {
        List* nl = createList();
        nl = appendList(nl, makeSymbolAtom(makeString("'", 1)));
        nl = appendList(nl, item);
        item = makeListAtom(nl);
    }
    addTo = appendList(addTo, item);
    return addTo;
}

Once we have our list, it is up to Eval in conjunction with the runtime environment to determine the meaning of the various symbols, as covered in my previous post on the eval/apply cycle.

Thats all I've got for today, Happy Hacking!

MaxGCoding.com

For the love of programming

Parsing Lisp: From Data To Code

Atoms and Lists

From Strings to Lists

Leave A Comment

The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from Followpos

The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table

Procedural Map Generation with Binary Space Partitioning

Exact Match String Searching: The Knuth-Morris-Pratt Algorithm

The visitor pattern: OOP takes on the Expression Problem

Improving mgcLisp's define syntax

Separating the Men from the Boys, Knuth Style

Reducing rotations during deletion from AVL trees

Parsing Lisp: From Data To Code

Swiz Heaps: A deterministic modification of Randomized Meldable Heaps