Capture Groups: Tracking Regular Expression Sub Matches
Nine out of ten times when we reach for regular expressions its because we want to simply know "does this text contain this pattern?". A simple boolean expression: yes or no. Sometimes we want to know the position of the entire match, as in lexical analysis. Non Deterministic Finite Automata (NFA) are well suited to this task and their application is well studied and documented.
There are times however when we want to know more: how did the pattern match text? Which part of the expression matched which part of the string? To answer those questions we need something a bit more powerful than a run-of-the-mill NFA, and there are a couple of ways one could proceed. One such method is to use a "tagged" nfa. Tagged NFA's augment transitions or states (depending on implementation) with a "tag" which functions to record the current position in the string being matched when that tagged transition or state is processed. TNFA's are much, much less common than their untagged counterparts and so the body of existing knowledge is far less than for regular NFAs.
In todays post I'm going to go through adding positional tags to NFA's using thompsons construction as well as how to efficiently simulate their execution so as to track submatch boundaries without resorting to backtracking.
What is a capture group, anyway?
Complex regular expressions often have multiple sub-matches, these sub-matches when comprised of a parenthsized regular expression is called a capture group. The only exception to this is capture group 0, which is th entire expression, whether explicitly parenthesized or not.
Looking at the following expression, we can see how it is broken down into it's individual capture groups. With the addition of capture groups our goal of simulating the nfa is now to not only return whether a string matches the provided pattern, but also which parts of the matched string belongs to each capture group.
(r([aeiou]+)(m|n)(d|a)) - Group 0
([aeiou]+) - Group 1
(m|n) - Group 2
(d|a) - Group 3
If we were to match the word "reindeer" against our pattern it would match the following capture groups:
reind - Group 0
ei - Group 1
n - Group 2
d - Group 3
In order to extract the capture groups from a match we're going to need a method for tracking when we have "entered" and "exited" each capture group. This should be simple enough, as it directly corresponds to a left and right parentheses, respectively.
Normally that is exactly the type of "concrete" syntax which is left out of an abstract syntax tree when parsing an expression, so that is the first place we should start our modifications: adding parens (at least logically) back in to our AST structure.
Adding Capture Nodes to the AST
In addition to the operator nodes, character literal nodes, and character class nodes used to for constructing ASTs previously, we'll be adding a capture group type. Adding capture group nodes to our AST is easy enough, as they become the root node of a parenthesized expression.
re_ast_t* make_capture_group_node(int group) {
re_ast_t* node = make_ast_node(CAPGRP);
node->group = group;
return node;
}
re_ast_t* factor(ParseStr_t* str) {
re_ast_t* node;
if (expect(str, '(')) {
match(str, '(');
node = make_capture_group_node(++capGrp);
node->left = expr(str);
match(str, ')');
} else if (is_literarl(lookahead(str))) {
/* etc etc */
}
return node;
}
This also means the overal root of every AST will be a capture group node in order to represent capture group 0 - the whole expression. Now that we will know where each capture group is in our AST, we can move forward with adding that info to the NFA.
Tagging an NFA's Expression Bounds
Labeling our NFA with positional tags relies on a few observations with how we tie regular expressions to AST's, and extrapolating from those observations:
- We know as we traverse our AST that when we arive at a capture group node, that it is equivelant to encountering the left parentheses of an expression.
- Likewise, as the traversal continues and we have finished processing the expression rooted at the node, it is equivelant to enocuntering the right parentheses.
- Having compiled the equivelant NFA for the expression rooted at that node, we are left with the start and accept states of said NFA.
- It then follows that the start state is now equivelant to the left paren of said expression, and the accept state is now equivelant to the rightparen.
If we add a way to distinguish those states during construction so that during traversal of the machine we know when we have crossed those particular states, then we would also habe a way of knowomg where in the input string we are when we crossed those states. In this way we can track the sub-string boundaries for capture groups of matches.
We can do precisely this by adding an "is_tagged" flag to our state structure, along with the actual tag structure itsself.
typedef struct re_nfa_state_t_ {
State label;
bool is_tagged;
re_tag_t tag;
re_nfa_transition_t* trans[2];
} re_nfa_state_t;
We utilize an enum to differentiate whether the tag represents the start or end of a given capture group. the tag also holds the index of the capture group it belongs to. We will use this value later during the traversal to index into a dictionary of capture group boundaries. A state can bed tagged for more than capture group, and so each tag keeps a "next" pointer in the case it becomes a states tag list.
typedef enum TagType {
START, END
} TagType;
typedef struct re_nfa_tag_t {
TagType type;
int group;
struct re_nfa_tag_t* next;
} re_tag_t;
re_nfa_state_t* add_tag(re_nfa_state_t* state, TagType type, int group) {
state->is_tagged = true;
state->tag.group = group;
state->tag.type = type;
state->tag.next = &state->tag;
return state;
}
Having already modified our AST, tagging the actual NFA during compilation becomes deliciously simple: Upon encountering a capture group node, we compile the expression rooted at that node, and add a START tag to the compiled NFAs start state and an END tag to its accept state.
void compile(re_ast_t* ast, Stack* stack) {
if (ast == NULL)
return;
if (ast->type == LIT) {
push(stack, makeAtomicNFA(make_char_transition(NULL, ast->ch)));
} else if (ast->type == CHCLASS) {
push(stack, makeAtomicNFA(make_ccl_transition(NULL, ast->ccl)));
} else if (ast->type == CAPGRP) {
compile(ast->left, stack);
re_nfa_t* nfa = pop(stack);
push(stack, makeTaggedNFA(ast->group, nfa));
} else {
//compile operator NFA.s
}
}
re_nfa_t* makeTaggedNFA(int group, re_nfa_t* nfa) {
nfa->start = add_tag(nfa->start, START, group);
nfa->accept = add_tag(nfa->accept, END, group);
printf("Tagged group %d to states %d and %d\n", group, nfa->start->label, nfa->accept->label);
return nfa;
}
With our positional tags in place our modified NFA is complete and we can turn our attention to modifying the matching algorithm.
The Power set construction, with a touch of Ariadne
A new return value, containing the capture groups bounds
#define MAX_CAPTURE 10
typedef struct MatchContext {
bool did_match;
int num_groups;
MatchBounds groups[MAX_CAPTURE];
Thread* matched_path;
} MatchContext;
MatchContext* match_re(re_nfa_t* nfa, char* text);
Ariadne's thread
typedef struct Thread {
re_nfa_state_t* current;
struct Thread* previous;
int id;
char ch;
int pos;
} Thread;
Thread* makeThread(re_nfa_state_t* curr, Thread* prev, int pos, char ch) {
Thread* t = alloc_thread();
t->current = curr;
t->previous = prev;
t->id = -1;
t->ch = ch;
t->pos = pos;
return t;
}
Modifying move() to use the new thread structure
void move(OrderedSet* states, OrderedSet* next, MatchContext* cxt, char* text, int pos) {
char ch = text[pos];
printf("move(%d, %c)\n", pos, ch);
RBIterator it;
rb_iter_init(&it, states);
while (!rb_iter_done(&it)) {
Thread* th = ((Thread*)rb_iter_get(&it)->value);
for (int i = 0; i < 2; i++) {
re_nfa_transition_t* trans = th->current->trans[i];
if (trans != NULL && trans->is_epsilon == false) {
if ((trans->is_ccl && match_ccl(ch, trans)) || (trans->data[0] == ch || trans->data[0] == '.')) {
setAdd(next, makeThread(trans->dest, ((Thread*)rb_iter_get(&it)->value), pos, ch));
printf("\t%d ->(%c)-> %d \n", th->current->label, ch, trans->dest->label);
}
}
}
rb_iter_next(&it);
}
clearSet(states);
printf("----------\n");
}
The biggest change is to e_closure()
void e_closure(OrderedSet* states, OrderedSet* next, MatchContext* cxt, re_nfa_t* nfa, char* text, int pos) {
char ch = text[pos];
printf("e_closure(%d, %c)\n", pos, ch);
Stack ss;
initStack(&ss);
RBIterator it;
rb_iter_init(&it, states);
while (!rb_iter_done(&it)) {
push(&ss, (Thread*)rb_iter_get(&it)->value);
setAdd(next, (Thread*)rb_iter_get(&it)->value);
rb_iter_next(&it);
}
while (!empty(&ss)) {
Thread* curr_thr = ((Thread*)pop(&ss));
if (curr_thr->current->label == nfa->accept->label) {
cxt->did_match = true;
cxt->matched_path = curr_thr;
}
for (int i = 1; i >= 0; i--) {
re_nfa_transition_t* trans = curr_thr->current->trans[i];
if (trans != NULL && trans->is_epsilon) {
Thread* t = makeThread(trans->dest, curr_thr, pos, ch);
if (!setContains(next, t)) {
setAdd(next, t);
push(&ss,t);
printf("\t%d ->(%s)-> %d \n", curr_thr->current->label, trans->data, trans->dest->label);
}
}
}
}
clearSet(states);
printf("----------\n");
}
The main matching loop, with restarting on the empty set
MatchContext* match_re(re_nfa_t* nfa, char* text) {
next_th = 0;
MatchContext* cxt = makeContext();
OrderedSet states, next;
initSet(&states, &cmp_threads);
initSet(&next, &cmp_threads);
setAdd(&states, makeThread(nfa->start, NULL, 0, text[0]));
e_closure(&states,&next,cxt, nfa, text, 0);
for (int i = 0; text[i] != '\0'; i++) {
move(&next, &states, cxt, text, i);
e_closure(&states,&next,cxt, nfa, text, i);
if (setEmpty(&next)) {
clearSet(&states);
setAdd(&states, makeThread(nfa->start, NULL, i+1, text[i+1]));
e_closure(&states, &next, cxt, nfa, text, i+1);
}
}
if (cxt->did_match) {
updateWinner(cxt->matched_path, cxt, text);
printf("\n");
}
rb_destroy(&states);
rb_destroy(&next);
return cxt;
}
Extracting capture groups from the winning path:
void updateWinner(Thread* it, MatchContext* cxt, char* text) {
if (it != NULL) {
updateWinner(it->previous, cxt, text);
if (it->current->is_tagged) {
updateTagContext(cxt, it->current, it->pos, it->ch);
for (re_tag_t* t = it->current->tag; t != NULL; t = t->next)
printf("<%c%d>", t->type == START ? 'S':'E' ,t->group);
}
printf("{%d: %d, %c}", it->current == NULL ?-1:it->current->label, it->pos, it->ch);
}
}
et voila.
-
Capture Groups: Tracking Regular Expression Sub Matches
-
Resolving Shift/Reduce Conflicts With Operator Precedence
-
Squeezing DFAs with Pair Compression
-
Designing Abstract Syntax Trees: Homogenous vs. Heterogenous Node Structures
-
From LR Items to LR States
-
Calculating Follow Sets of a Context Free Grammar
-
Streaming Images from ESP32-CAM for viewing on a CYD-esp32
-
Constructing the Firsts Set of a Context Free Grammar
-
Inchworm Evaluation, Or Evaluating Prefix-Expressions with a Queue
-
Data Structures For Representing Context Free Grammar
Leave A Comment