Ternary Search Tries: String Specific Ordered Symbol Tables
It can be posited that strings are the most important data type in computing. Without character strings, all we have is a calculator. It is strings which allow us to compose documents, read and send emails, program in anything higher level than binary, and it is character strings which comprise the very information you are currently reading. So with such an important and central position in computing you would think alot of care has gone in to developing efficient data structures and algorithms to support their role. And there has been, but not quite as much as you would expect.
Many of the data structures and algorithms which are efficient for the other fundamental data types (integers, characters, etc.) begin to struggle when adapted for use on strings. The problem boils down to one concept that anyone sufficiently familiar with sorting should already be well aware of: key comparisons. When using strings as they key for a symbol table, comparisons are very expensive. Why is this?
Why Comparing Strings Costs More
Character strings are an aggregate data type. As the name would suggest they are comprised of multiple characters, and so when it comes to comparing them, more work need be done than when comparing integers. When we compare two integers using operator< the bit values of those two integers are tested.
int less(int a, int b) {
return a < b;
}
However, when we do the same with strings, something else happens behind the scenes. To determine if one string is "less" than another(lexicographically speaking) we compare their individual characters, repeating the process until we either consume both keys, or find the point at which they differ. For std::string, operator< is overloaded to call strcmp() on the underlying character arrays used to store the strings.
bool less(const string& a, const string& b) {
return a < b;
}
bool operator<(const std::string& a, const std::string& b) {
return strcmp(a.c_str(), b.c_str()) < 0;
}
int strcmp(char *a, char* b) {
while (*a != '\0' && *b != '\0')
if (*a++ != *b++)
break;
return a - b;
}
Now, the above code isnt the exact code used by the standard library, but for our purposes it's close enough. strcmp works by comparing each character against each other so long as they are equal, and then returns the difference in lengths at which they stopped matching. If the returned value is:
- less than 0, string a is "less" than string b.
- greater than 0, string a is "greater" than string b.
-equal than, well, the strings are equal... I thought that would be obvious.
Clearly If our data structure is based on key comparisons, then it will have much different performance characteristics with strings than for integers, as an example. We can see this with data structures such as RedBlack Trees. It is for this reason that hashtables are the perferred symbol table for applications with string based keys.
You might be wondering that If the answer to our problems is hash tables, than what are we even doing? Well thats thing. Hash tables are a answer, but they are not the answer. Ever seen a sorted hash table? me neither. Symbol tables sometimes need to support operations beyond just insert and find, and if for example we need our data in sorted order hash tables are suddenly far from ideal. "Tries!" is probably you're next response, but if we have millions, or billions of keys that O(R) space requirement per-node suddenly seems even less appealing than it already did.
And that lead's us to the data structure we'll be implementing today.
Ternary Search Tries
TST's hold a unique place amongst symbol table data structures, blending the best properties of prefix tries and binary search tree's while rivaling the speed of hashtables for string searching. Being a linked structure, ternary search tree's are made of nodes with links to three children, as the name implies.
template <class T>
struct Node {
char c;
T info;
Node* l;
Node* m;
Node* r;
Node(char s, T item) : c(s), info(item) {
l = m = r = nullptr;
}
};
One optimization often seen with TST's, is to have an array of root pointers, instead of a single root pointer. This way each character of our input alphabet get's it's own TST. While not required, it can lead to a reasonable increase in search and insertion times for practically no overhead and so I choose to implement it this way..
const int R = 256;
template <class T>
class TST {
private:
using node = Node<T>;
using link = node*;
T nullInfo;
link head[R];
link split(link curr, char splitChar, T info);
link put(link h, string key, T info, int d;
link get(link h, string key, int d);
void preorder(link h, string word);
public:
TST() {
for (int i = 0; i < R; i++)
head[i] = nullptr;
}
void insert(string key, T info);
T& find(string key);
void print();
};
Most time when introducing data structures the search algorithm is presented before the insertion algorithm, however due to a few quirks of how the structure is actually put together, i find it beneficial to present the insertion algorithm first. With this mind, let's turn our attention to building the trie.
Top-Down Insertion
Insertion in TST proceeds top down in a way similar to other prefix tries. The difference arrises when a character collision occur, as the branching strategy is then more closely related to binary search trees. The data to be associated to each key is kept in a special leaf node that doubles as an "end of string" sentinel. Despite being a leaf node, it can become an internal node if a key that contains an already existing key as a prefix is inserted in the tree. Leaf nodes are marked by having a '$' as their symbol.
As mentioned above, we keep an array of root nodes, so each letter of our input alphabet has it's own root TST.
void insert(string key, T info) {
head[key[0]] = put(head[key[0]], key, info, 0);
}
The string to be inserted is iterated character by chacter as it is inserted into the tree. If we have not reached the end of our key, but the current node is null, we create a new node for the current character of the key, and continue the insertion process.
If the current node is not null but, the current nodes symbol is a different character then the current characrter in the key, then we have encountered a character collision. We resolve collisions by branching as we would in a binary search tree: by comparing the current character in our key, to the current nodes character. If our character is lower in value than the current nodes we chode the left branch, otherwise choosing the right branch.
link put(link h, string key, T info, int d) {
char c = d < key.length() ? key[d]:'$';
if (h == nullptr)
h = new node(c, nullInfo);
if (d == key.length()) {
return makeLeaf(h, c, info);
}
if (h->c == '$') h->m = put(h->m, key, info, d);
else if (c < h->c) h->l = put(h->l, key, info, d);
else if (c > h->c) h->r = put(h->r, key, info, d);
else h->m = put(h->m, key, info, d+1);
return h;
}
Only If it our character is the same as the current node do we take the middle branch. This is also the only condition for which we advance our position in the key string.
Upon reaching the end of our string, we need to associate our satellite data with the key being inserted, which may require "splitting" an already existing key. The makeLeaf() procedure handles this process in a straightforward way. First, we check if we are simply updating an already existing key's data, in which we case we simply assign the data and return. The tricky bit comes in the event we are associating data to a key which also happens to be a prefix of an existing key. In this event, we
link split(link curr, char splitChar, T info) {
if (curr->c == '$') {
curr->info = info;
return curr;
}
return makeInternal(curr, splitChar, info);
}
link makeInternal(link curr, char splitChar, T info) {
link x = new node('$', info);
if (splitChar < curr->c) {
x->l = curr->l;
curr->l = x;
} else if (splitChar > curr->c) {
x->r = curr->r;
curr->r = x;
} else {
x->m = curr->m;
curr->m = x;
}
return curr;
}
With the ability to build our tree taken care of, we now turn our attention to searching and traversal in our TST.
Searching And Traversal
Searching proceeds very similarly to insertion, however without first covering insertion the notion of handling leaf nodes that become internal sentinel nodes for end of string marking would make little sense. Now that we understand how our '$' nodes behave and their purpose, the search algorithm requires little explaination.
link get(link h, string key, int d) {
if (h == nullptr)
return nullptr;
if (d == key.size())
return h;
char c = key[d];
if (c == h->c || h->c == '$') return get(h->m, key, d+1);
else if (c < h->c) return get(h->l, key, d);
else if (c > h->c) return get(h->r, key, d);
return h;
}
It is important to understand that any move along a left or right branch does not change our position in the search key. The only time we advance to the next character in our key string is when taking the middle branch.
void print(link h, string word) {
if (h != nullptr) {
if (h->c == '$') {
cout<<word<<" - "<<h->info<<endl;
}
print(h->l, word);
print(h->m, word + string(1, h->c));
print(h->r, word);
}
}
void print() {
for (int i = 0; i < R; i++) {
if (head[i] != nullptr) {
cout<<char(i)<<": "<<endl;
string word;
print(head[i], word);
}
}
cout<<endl;
}
-
Parsing Array Subscript Operators with Recursive Descent
-
Map & Filter in Scheme & C
-
Persistent Symbol Tables for Lexical Scoping
-
The Festival of 1 + n + f(n-1) Lights
-
The Heart of Pratt Parsing: Top-Down Operator Precedence
-
Compiling expressions to P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Using Huffman Coding Tries
-
The LZ77 Algorithm: Lossless Compression with a Sliding Window
Leave A Comment