Exact Match String Searching: The Knuth-Morris-Pratt Algorithm
Searching for the occurence of one string within the body of a larger string is a process done millions of times, and is one of the central operations in text processing. It is performed as part of querying a search engine and whenever you press Ctrl + F to search for a document or page for a keyword, an exact match string search takes place. Despite its ubiquity there are only a handful of algorithms to choose from, and even less which are easily understood. To compound matters further, the problem is such that the classical brute force algorithm while possible to improve on, as we will show shortly, it certainly is not easy.
The Brute Force Approach
The following algorithm will search text for the occurence of pattern in time O(MN) where M is the length of the pattern and N is the length of the text. It is simple to understand, and in many situations runs in time closer to O(N + M). The difference in running times stems from the amount of times the search finds a partial match.
int bruteForceStringSearch(string pattern, string text) {
int i,j,k;
for (i = 0; i < text.length(); i++) {
for (j = 0, k = i; k < text.length() && pattern[j] == text[k]; j++, k++);
if (j == pattern.length())
return i;
}
return -1;
}
The reason for this inneficiency is because no matter how long the partial match was, the algorithm rewinds all the way back to the start position+1 to start over matching the pattern from the next character. Knowing this, then its safe to assume that If we could somehow avoid having to do that rewind phase, we could greatly improve our running time. This is the line of thought which drove several separate researchers to arrive at generally the same solution, despite different formulations of the original problem. That solution is what has come to be known as the Knuth-Morris-Prratt (KMP) string searching algorithm.
The Knuth-Morris-Pratt Algorithm
The KMP algorithm allows for string matching without backtracking by incorporating a pre-processing step into the algorithm. This processing constructs a DFA called the "failure function". The failure function is comprised of an array, which when indexed by the the current position in the pattern and compared to the character being examined will tell us which state to go to next to continue searching.
We are able to construt this failure function because of the cunning observation that despite not having the text to be searched, we do have access to the pattern. Why does this matter? Because any text which matches the pattern, has the same character transitions for a match case as the prefix of the pattern being searched for.
class KMP {
private:
string pattern;
vector<int> next;
public:
KMP(const string& pat) {
pattern = pat;
next = vector<int>(pattern.length(), 0);
next[0] = -1;
for (int i = 0, j = -1; i < pattern.length(); i++, j++, next[i] = j)
while ((j >= 0) && (pattern[i] != pattern[j]))
j = next[j];
}
int match(string text) {
int i, j;
for (i = 0, j = 0; i < text.length() && j < pattern.length(); i++, j++)
while ((j >= 0) && (text[i] != pattern[j]))
j = next[j];
return (j == pattern.length()) ? i-pattern.length():-1;
}
};
And thanks to the connection between the prefix of the pattern being compared and the text being searched, an algorithm which never needs to be back up is easily developed. Until next time, Happy Hacking!
-
Procedural Map Generation with Binary Space Partitioning
-
Exact Match String Searching: The Knuth-Morris-Pratt Algorithm
-
The visitor pattern: OOP takes on the Expression Problem
-
Improving mgcLisp's define syntax
-
Separating the Men from the Boys, Knuth Style
-
Reducing rotations during deletion from AVL trees
-
Parsing Lisp: From Data To Code
-
Swiz Heaps: A deterministic modification of Randomized Meldable Heaps
-
Let's talk Eval/Apply
-
BST Deletion: Removal By Merge
Leave A Comment