Squeezing DFAs with Pair Compression

It's no secret that the price we pay for using a DFA in the process of lexical analysis is the (potentially) enourmous transition tables which must be managed. There are many ways of representing transition tables. Anyone who has peaked at an uncompressed transition matrix of even a small grammar can tell you that storing them as-is will waste alot of space, For ASCII, each state would need 256 entries, many of them blank. For unicode, well, theres 1.1 million possible unicode characters, so you do the math on that one.

Pair Compressed Rows

A table with pair compressed rows doesnt store empty entries, instead each row becomes, as the name would suggest, a list of pairs with an exception for the first entry, which is a singleton containing the number of  pairs in each row.

In this way, a row that in it's uncompressed form is represented like this:

int example_row[] = { 0, 0, 0, 4, 0, 0, 2, 0, 7, 0 };

Is transformed into the following representation:

int compressed_row[] = {3, 3,4, 6,2, 8,7};

The new array has the number of pairs as the first entry, followed by the actual list of pairs of (col, row). Over many rows, each being many columns wide this adds up to substantial savings. The original square 2d matrix instead becomes an array of pointers to different size arrays, which is why we reserve the first entry for storing the length. In languages where arrays store their length implicitly, this can be skipped.

Compressing an Existing Matrix

With the first entry of every row being the entry count, the first step is obtaining said entry count. In my implementation, the error state is 0, so we scan each row for non-zero entries, summing their total.

int entryCount(int matrix[][256], int i) {
    int count = 0;
    for (int j = 0; j < 255; j++) {
        if (matrix[i][j] != 0) {
            count++;
        }
    }
    return count;
}

Next, we allocate an array of the correct size for the compressed row, which will be twice the entry count plus one for the entry count itsself.

int** compressMatrix(int matrix[][256], int num_states) {
    int** cm = malloc(sizeof(int*)*num_states+1);
    for (int i = 1; i <= num_states; i++) {
        int ec = entryCount(matrix, i);
        if (ec == 0) {
            cm[i] = NULL;
        } else {
            int *row = malloc(sizeof(int)*(2*ec+1));
            row[0] = ec;
            int epos = 1;
            for (int j = 0; j < 256; j++) {
                if (matrix[i][j] != 0) {
                    row[epos++] = i;
                    row[epos++] = j;
                }
            }
            cm[i] = row;
        }
    }
    return cm;
}

Obtaining the next state in compressed tables

The eagle eyed among you may have noticed that we have essentially turned the matrix representation of a graph into its adjacency list representation, and this is totally accurate. However, one side effect of using this compressed format is that we can no longer obtain the next state with a simple array lookup anymore. Instead, we have to scan our list of pairs. As previously mentioned, the first item in each row is the number of pairs the list contains, the remaining spaces are the pairs themselves.

To find the next state, we could use could old sequential search where we start at position one, and go two-by-two down the list of pairs until we find a match or run out of pairs:

int Lexer::get_next(int state, char p) {
    if (mgc_lexer_matrix[state]) {
        for (int i = 1; i < 2*mgc_lexer_matrix[state][0]; i += 2) {
            if (mgc_lexer_matrix[state][i] == p)
                return mgc_lexer_matrix[state][i+1];
        }
    }
    return 0;
}

Since our lists will alway be short, we shouldnt pay so high a penalty to find the entry were looking for. If however, were using this DFA for lexing, where its used to look at every single character of input then we certainly still want to try and avoid having to accept O(N) performance when avoidable. 

Thankfully, in this case we don't have to: our lists are still naturally sorted lists, so we can use binary search to cut that search time down to O(log N) comparison per character to find the next state. It's not O(1), but its tolerable.

int find(int curr, char p) {
    int l = 1, r = 2*mgc_lexer_matrix[curr][0];
    while (l <= r) {
        int m = (l+r)/2;
        if (m % 2 == 0) m--;
        if (p < mgc_lexer_matrix[curr][m]) {
            r = m - 2;
        } else if (p > mgc_lexer_matrix[curr][m]) {
            l = m + 2;
        } else {
            return mgc_lexer_matrix[curr][m+1];
        }
    }
    return 0;
}

int Lexer::get_next(int state, char p) {
    if (compressed) {
        if (mgc_lexer_matrix[state] != NULL) {
            return find(state, p);   
        }
        return 0;
    }
    return mgc_lexer_matrix[state][p];
}

You'll notice we only have to make a few slight adaptations to our arrays schemea to be able to use binary search effectively. First, we know that our keys are only on the odd indices, so we need to make sure our calculated midpoint is an odd index. Next, when adjusting the bounds, we so by 2 instead of 1 so that we account for the key:value pair nature of our array.

And that's all there is to it! Happy hacking.


Leave A Comment