Compiling Regular Expressions for "The VM Approach"
I've mentioned before that when it comes to the implementation of regular expression matching, the conversation as it appears in the literature tends to begin with NFA's, gives a quick run down of Thompsons Construction, and ends at DFA's with a short tour of table size reduction along the way. You would be excused for concluding that the state of the art stagnated sometime in the early 1970s. Despite first appearances, this is not actually the case, and any programmer with an interest in regular expression engines and their implementation has almost certainly come across the phenomenal series of articles written by Russ Cox on the subject.
The first of what turned in to a multipart series of articles is titled "Regular Expression Matching can be Simple and Fast" and is a terrific introduction to NFA based regular expression matching. In particular the articles treatment of the power set algorithm first documented by Thompson for simulating an NFA in linear time is very insightful, particularly to how it relates to the catastrophic backtracking problem inherit to many RegEx engines.
The focus of todays post however is the second article in his series, "Regular Expression Matching: The VM approach". In this paper Cox details a method of encoding the NFA as instructions to be executed by a virtual machine instead of as a directed graph of states and transitions to be traversed. Alas, the paper focused mainly on the virtual machine side of things and their various implementations, leaving the compilation process of programs for said virtual machine largely unexplained.
Being interested in both regular expressions and compilers I have decided to (hopefully) explain the process of converting a regular expression as a string into a sequence of instructions to be executed by the virtual machines described in Russ Cox's article. Well, lets get to it.
NFAs as Virtual Machines: The Instruction Set
Part of what makes this method both fascinating and fast is the small, yet thoughtful instruction set used to control the virtual machine. The basic implementation has 5 instructions: ANY, CHAR, SPLIT, JUMP, and MATCH.
|
You could actually get by with 4 instead of 5 by handling the ANY and CHAR instructions using just the CHAR instruction. Additionally, I also add a HALT instruction (not shown), which stops all execution though this is strictly for convenience, and is not necessary.
When I said the compilation process was left largely unexplained in the original article, i wasnt implying that it was totally unexplained. Of most importance to our purposes is a table depicting the correct instruction sequencing for the common regular expression operators.
a |
char a |
e1e2 | codes for e1 codes for e2 |
e1|e2 |
split L1, L2L1: codes for e1 jmp L3L2: codes for e2L3: |
e? |
split L1, L2L1: codes for eL2: |
e* |
L1: split L2, L3L2: codes for e jmp L1L3: |
e+ |
L1: codes for e split L1, L3L3: |
In addition to the above table on instruction sequences, we are also given an example of what instructions the regular expression (a+b+) should compile down to:
0 char a
1 split 0, 2
2 char b
3 split 2, 4
4 match
Like I said, not exactly a treasure trove of information, but certainly enough to get us started. With that In mind, lets shift gears towards implementing the actual instruction objects.
package com.maxgcoding.regex.vm;
public abstract class VMInstruction {
abstract public Integer getNext();
abstract public Integer getAlternate();
abstract public VMInstruction setNext(Integer i);
abstract public VMInstruction setAlternate(Integer i);
abstract public Boolean canMatch(Character ch);
}
Each instruction is reprsented by an Instruction object subclassed by its type as shown above. Regardless of type, each instruction has pointers to the next instruction(s) and a canMatch() method. I'm using array indices instead of actual pointers to the object as pointers don't behave in Java as they do in C, and using the array indices prove a valuable asset for visually tracing execution while debugging.
package com.maxgcoding.regex.vm.inst;
import com.maxgcoding.regex.vm.VMInstruction;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;
@NoArgsConstructor
@Accessors(chain=true)
public class CharInstruction extends VMInstruction {
private Integer next;
private Integer alternate;
private Character operand;
@Override
public Integer getNext() {
return next;
}
@Override
public Boolean canMatch(Character ch) {
return operand.equals(ch);
}
@Override
public Integer getAlternate() {
throw new UnsupportedOperationException("Char instructions don't have alternates");
}
@Override
public VMInstruction setNext(Integer i) {
next = i;
return this;
}
@Override
public VMInstruction setAlternate(Integer i) {
throw new UnsupportedOperationException("Char instructions don't have alternates");
}
public VMInstruction setOperand(Character ch) {
this.operand = ch;
return this;
}
@Override
public String toString() {
return "[ CHAR " + operand + " " + next + " ]";
}
}
None of the Instructions use EVERY field, with most only using one. Fields not used by the current instruction throw "UnsupportedOperationExceptions" if accessed. The JUMP instruction only uses the next field, while the CHAR instruction uses the the next field and the canMatch/operand method/field.
package com.maxgcoding.regex.vm.inst;
import com.maxgcoding.regex.vm.VMInstruction;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;
@NoArgsConstructor
@Accessors(chain=true)
public class JumpInstruction extends VMInstruction {
private Integer next;
private Integer alternate;
@Override
public Integer getNext() {
return next;
}
@Override
public Integer getAlternate() {
throw new UnsupportedOperationException("Jump instructions don't have alternates");
}
@Override
public VMInstruction setNext(Integer i) {
next = i;
return this;
}
@Override
public VMInstruction setAlternate(Integer i) {
throw new UnsupportedOperationException("Jump instructions don't have alternates");
}
@Override
public Boolean canMatch(Character ch) {
return true;
}
@Override
public String toString() {
return "[ JUMP " + next + " - ]";
}
}
The split instruction uses both the next and alternate fields (obiously), and does not need the canMatch method as there is nothing to match with.
package com.maxgcoding.regex.vm.inst;
import com.maxgcoding.regex.vm.VMInstruction;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;
@NoArgsConstructor
@Accessors(chain=true)
public class SplitInstruction extends VMInstruction {
private Integer next;
private Integer alternate;
@Override
public Integer getNext() {
return next;
}
@Override
public Integer getAlternate() {
return alternate;
}
@Override
public VMInstruction setNext(Integer i) {
next = i;
return this;
}
@Override
public VMInstruction setAlternate(Integer i) {
alternate = i;
return this;
}
@Override
public Boolean canMatch(Character ch) {
throw new UnsupportedOperationException("What are you trying to match on a SPLIT?");
}
@Override
public String toString() {
return "[ SPLIT " + next + " " + (alternate == null ? "-":alternate) + " ]";
}
}
And at the risk of sounding obvious, the only field MATCH needs is canMatch().
package com.maxgcoding.regex.vm.inst;
import com.maxgcoding.regex.vm.VMInstruction;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;
@NoArgsConstructor
@Accessors(chain=true)
public class MatchInstruction extends VMInstruction {
private Integer next;
private Integer alternate;
private String operand;
@Override
public Boolean canMatch(Character ch) {
return true;
}
@Override
public Integer getNext() {
throw new UnsupportedOperationException("Not supported on MATCH");
}
@Override
public Integer getAlternate() {
throw new UnsupportedOperationException("Not supported on MATCH");
}
@Override
public VMInstruction setNext(Integer i) {
throw new UnsupportedOperationException("Not supported on MATCH");
}
@Override
public VMInstruction setAlternate(Integer i) {
throw new UnsupportedOperationException("Not supported on MATCH");
}
@Override
public String toString() {
return "[ MATCH ]";
}
}
Phew, Ok, lets get to compiling.
The Compiler
Finally, were ready for the main event, and it should come as a surprise to nobody that we will once again find ourselves performing a depth first traversal over an abstract syntax tree. You DO have the regular expression parsed into an abstract syntax tree, right? When working with compilers/interpreters it kind of comes with the territory.
We'll be doing two traversal during the course of our compilation, the first to count how many instructions will be needed, and the second to generate the instructions. Each instruction is stored in an array as it's being emitted with the instruction pointer (ip) being used to track the current position in the array as we emit instructions.
package com.maxgcoding.compile;
import com.maxgcoding.parse.Node;
import com.maxgcoding.vm.InstType;
import com.maxgcoding.vm.VMInstruction;
public class ByteCodeCompiler {
private VMInstruction[] code;
private int ip;
If we look back at our instruction set we know that each instruction only takes one array space, and each operator needing at most 3. We can make a good estimate of how many instructions our program will require so we can allocate an optimal sized array. Because we know how many times we would call emit() for a given AST, we can determine how much space is needed by doing a depth first traversal of the tree:
private int countInstructions(AST curr) {
if (curr == null)
return 0;
int numEmits = 0;
switch (curr) {
case CharClassNode _ -> numEmits = 1;
case LiteralNode _ -> numEmits = 1;
case OperatorNode node -> { numEmits = countEmits(node); }
default -> { }
}
return numEmits;
}
private int countEmits(OperatorNode opnode) {
int numEmits = 0;
switch (opnode) {
case ConcatNode node -> numEmits = countInstructions(node.getLeft()) + countInstructions(node.getRight());
case OrNode _, StarClosureNode _ -> numEmits = 2 + countInstructions(opnode.getLeft()) + countInstructions(opnode.getRight());
case PlusClosureNode _ , QuestClosureNode _ -> numEmits = 1 + countInstructions(opnode.getLeft()) + countInstructions(opnode.getRight());
default -> { }
}
return numEmits;
}
We can also use a method like the following to dynamically grow the array if for some reason our estimate was too small to hold all of the needed instructions.
private void grow(int newSize) {
VMInstruction[] tmp = code;
code = new VMInstruction[newSize];
System.arraycopy(tmp, 0, code, 0, MAX_CODE);
MAX_CODE = newSize;
}
Now that we know how many instructions we will be generating, lets get to generating them.
Emitting Instructions
Our instruction set contains alot of forward references, as such compiling our AST into the instruction set makes prodigious use of back patching. Back patching is a technique I've previously covered in my posts on compiling pascal statements to p-code. It's used during code generation for resolving forward references whos target may not be available at the time you need them.
private void emit(VMInstruction inst) {
code[ip++] = inst;
}
private int skipEmit(int numplace) {
ip += numplace;
return ip;
}
private void skipTo(int oldpos) {
ip = oldpos;
}
The skipEmit(int offset) method helps us with this, as it moves the instruction pointer forward by a provided offset number of spaces, returning the new instruction pointer as its result. skipTo(int target) on the other hand changes the instruction pointer to whatever address is provided to it.
public Instruction[] compile(AST node) {
init(countInstructions(node)+1); // + 1 for MATCH instruction
build(node);
emit(new MatchInstruction());
return code;
}
private void handleOperators(OperatorNode operatorNode) {
switch (operatorNode) {
case OrNode node -> handleOrOperator(node);
case StarClosureNode node -> handleKleeneOp(node);
case PlusClosureNode node -> handleAtLeastOnce(node);
case QuestClosureNode node -> handleZeroOrOnce(node);
case ConcatNode node -> {
build(node.getLeft());
build(node.getRight());
}
default -> { }
}
}
private void build(AST curr) {
switch (curr) {
case ConcatNode node -> handleOperators(node);
case OrNode node -> handleOperators(node);
case QuestClosureNode node -> handleOperators(node);
case PlusClosureNode node -> handleOperators(node);
case StarClosureNode node -> handleOperators(node);
case LiteralNode node -> handleLiteral(node);
case CharClassNode node -> handleCCL(node);
default -> { }
}
}
At each node of the tree we dispatch a procedure based on the node type, to emit the apropriate instructions for the operator or literal the node represents. Once we have finished traversing the ast, we emit a final MATCH instruction and return the array containing the compiled instructions.
Generating Instruction Sequences
Using our handy chart from above detailing the instruction sequences for our regular expressions, we can start implementing the procedures to emit them. The first is obviously the simplest, the CHAR instruction for handling literals.
a |
char a |
private void handleLiteral(LiteralNode node) {
if (node.getData().equals('.')) {
emit(new AnyInstruction().setNext(ip+1));
} else {
emit(new CharInstruction().setOperand(node.getData()).setNext(ip+1));
}
}
When I said the previous was the easiest, I was actually lying, concatenation is actually the easiest, as we simply call the build() procedure recursively on each of it's children.
| e1e2 | codes for e1 codes for e2 |
private void handleConcat(Node node) {
build(node.getLeft());
build(node.getRight());
}
With the simple stuff out of the way, we can start on the more involved operators, for which we'll make use of back patching.
Back Patching In Action
Our first introduction to back patching forward references will be gentle, as we will use it to handle the '?' operator for matching a character that appears exactly zero or one times. We will be implementing both the eager and lazy matching closure operators at the same time, as using this format the only thing we need to do to implement the lazy version is to swap the next and alternate values.
e? |
split L1, L2L1: codes for eL2: |
If you're unfamiliar with how to read these instruction sequence diagrams i'll give a quick breakdown. on the left is the example expression and on the right is the instruction sequence which should be generated for it. In the above example we see that we should first generate a split instruction and then instructions for 'e'.
L1 and L2 are labels to be used as "addresses" for our instruction pointer. L1 is a label pointing to the codes that should be emitted for the 'e' portion of 'e?' with L2 being the position immediatley after it. We can read 'split L1, L2' as "Create two branches of execution, one continuing from the code starting at L1, the other continuing from the code starting at L2".
private void handleZeroOrOnce(QuestClosureNode node) {
int splitPos = skipEmit(0);
int L1 = skipEmit(1);
build(node.getLeft());
int L2 = skipEmit(0);
skipTo(splitPos);
if (node.isLazy()) {
emit(new SplitInstruction().setNext(L2).setAlternate(L1));
} else {
emit(new SplitInstruction().setNext(L1).setAlternate(L2));
}
skipTo(L2);
}
You probably noticed the calls to skipEmit() using '0' as the parameter, so allow me to explain. Calling skipEmit with a value of 0 is used to get the current instruction pointer address. We do this to retrieve an address before calling skipEmit() with an offset so we can return to the skipped position to fill in the missing instructions once we know what addresses the forward references occupy. This way we can generate the code for the "forward" reference, and then go back and actually create it. This is the beauty of back-patching.
e+ |
L1: codes for e split L1, L3L3: |
private void handleAtLeastOnce(PlusClosureNode node) {
int L1 = skipEmit(0);
build(node.getLeft());
int L3 = skipEmit(0)+1;
if (node.isLazy()) {
emit(new SplitInstruction().setNext(L3).setAlternate(L1));
} else {
emit(new SplitInstruction().setNext(L1).setAlternate(L3));
}
}
The difference between the kleene closure(*) and "at least once" closure (+) is mostly the ordering in which we emit the instructions. The kleen closure can match zero repetitions, and so we need a way for the execution to completely skip matching on 'e' so we split first with the alternate instruction pointing to after the closures instruction block. repetition is accomplished by jumping back to the split instruction.
e* |
L1: split L2, L3L2: codes for e jmp L1L3: |
private void handleKleeneOp(StarClosureNode node) {
int L1 = skipEmit(0);
int L2 = skipEmit(1);
build(node.getLeft());
int jumpPos = skipEmit(0);
int L3 = skipEmit(1);
skipTo(L1);
if (node.isLazy()) {
emit(new SplitInstruction().setNext(L3).setAlternate(L2));
} else {
emit(new SplitInstruction().setNext(L2).setAlternate(L3));
}
skipTo(jumpPos);
emit(new JumpInstruction().setNext(L1));
}
And now for the complicated one: alternation. Interestingly, the sequence of calls to skipEmit()/emit()/skipTo() is very similar to that needed for translating an if-statement to P-Code. When you step back and think about it, this makes perfect sense as the "or" operator is essntially saying "if this pattern doesn't match, then try this other one".
e1|e2 |
split L1, L2L1: codes for e1 jmp L3L2: codes for e2L3: |
private void handleOrOperator(OrNode node) {
int splitPos = skipEmit(0);
int L1 = skipEmit(1);
build(node.getLeft());
int jumpPos = skipEmit(0);
int L2 = skipEmit(1);
build(node.getRight());
int L3 = skipEmit(0);
skipTo(jumpPos);
emit(new JumpInstruction().setNext(L3));
skipTo(splitPos);
emit(new SplitInstruction().setNext(L1).setAlternate(L2));
skipTo(L3);
}
I'm not going to cover chacter classes or anchored matches, as those are easy enough to adapt from what we have already implemented. Let's whip up some tests and see how it does.
Testing It Out
public final class App {
public static void main(String[] args) {
App app = new App();
app.vmexample("(a*b|a+c)d", List.of("aaaabd", "abd", "aaaacd", "acd", "bd", "cd"));
}
private void vmexample(String pattern, List<String> testStrings) {
Parser p = new Parser();
ByteCodeCompiler c = new ByteCodeCompiler();
Node ast = p.parse(pattern);
Traversal.traverse(ast);
VirtualMachine vm = new VirtualMachine(c.compile(ast), 0);
for (String str : testStrings) {
if (vm.match(str)) {
System.out.println("Match found!");
} else {
System.out.println("No match :(");
}
}
}
private App() {
}
}
And the output:
Checking aaaabd with '(a*b|a+c)d':
Executing 0: [SPLIT 1 6]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 4: [CHAR b ]
Executing 5: [JMP 9 null]
Executing 9: [CHAR d ]
Executing 10: [MATCH null null]
Match found!
Checking abd with '(a*b|a+c)d':
Executing 0: [SPLIT 1 6]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 4: [CHAR b ]
Executing 5: [JMP 9 null]
Executing 9: [CHAR d ]
Executing 10: [MATCH null null]
Match found!
Checking aaaacd with '(a*b|a+c)d':
Executing 0: [SPLIT 1 6]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 4: [CHAR b ]
Executing 4: [CHAR b ]
Executing 4: [CHAR b ]
Executing 4: [CHAR b ]
Executing 4: [CHAR b ]
Executing 6: [CHAR a ]
Executing 7: [SPLIT 6 8]
Executing 6: [CHAR a ]
Executing 7: [SPLIT 6 8]
Executing 6: [CHAR a ]
Executing 7: [SPLIT 6 8]
Executing 6: [CHAR a ]
Executing 7: [SPLIT 6 8]
Executing 6: [CHAR a ]
Executing 8: [CHAR c ]
Executing 9: [CHAR d ]
Executing 10: [MATCH null null]
Match found!
Checking acd with '(a*b|a+c)d':
Executing 0: [SPLIT 1 6]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 3: [JMP 1 null]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 4: [CHAR b ]
Executing 4: [CHAR b ]
Executing 6: [CHAR a ]
Executing 7: [SPLIT 6 8]
Executing 6: [CHAR a ]
Executing 8: [CHAR c ]
Executing 9: [CHAR d ]
Executing 10: [MATCH null null]
Match found!
Checking bd with '(a*b|a+c)d':
Executing 0: [SPLIT 1 6]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 4: [CHAR b ]
Executing 5: [JMP 9 null]
Executing 9: [CHAR d ]
Executing 10: [MATCH null null]
Match found!
Checking cd with '(a*b|a+c)d':
Executing 0: [SPLIT 1 6]
Executing 1: [SPLIT 2 4]
Executing 2: [CHAR a ]
Executing 4: [CHAR b ]
Executing 6: [CHAR a ]
No match :(
So there you have it, compiling Regular Expressions for matching with the "VM approach". As usual, all code is available on my Github linked below. Until next time, Happy Hacking!
Further Reading
-
A Quick tour of MGCLex
-
Compiling Regular Expressions for "The VM Approach"
-
Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction
-
Improving the Space Efficiency of Suffix Arrays
-
Augmenting B+ Trees For Order Statistics
-
Top-Down AST Construction of Regular Expressions with Recursive Descent
-
Balanced Deletion for in-memory B+ Trees
-
Building an AST from a Regular Expression Bottom-up
-
The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from the Followpos Table
-
The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table
Leave A Comment