Simplifying the erase operation for B+ trees
When it comes to data structures - especially self balancing data structures - it is no secret that the algorithms for removing an entry are often many, many times more complex than the algorithms for adding a value. Anyone who's tried to wrap their noggin around balanced deletion in Red/Black tree knows exactly what I'm talking about. The situation for B trees - the family of data structures which Red/Black trees abstract - isn't much different. B+ Trees, despite sharing a common heritage with B Tree's present their own constraints for removing an element from the tree.
A Unique Challenge
Traditional B Trees which are structured in a fashion akin to a binarys earch tree with data stored in the same node as the key that references it. A B+ tree however, is an indexing data structure: Keys and Values stored separately. In this case, it can help to reason about B+ Trees by thinking of them as a combination of two different data structures instead of as one unified data structure. Using this model, we imagine the upper portion of the tree to serve as an index to guide us to the correct node of a linked list which serves as the leaf level of the tree.
A B+ trees upper levels index into the linked list at the leaf level
We can now think of removing a value from a B+ tree being performed in two steps. The first step is the actual removal of the referenced value from the linked list. Once that is accomplished, we then move to the second step of performing any rebuilding of the index which may have become invalid by removing the leaf value.
Removing an entry from the leaf level is easy enough, Once we find the correct leaf node, removing a value reduces to the case of deleting an entry from a sorted array:
template <class K, class V>
void BPNode<K,V>::eraseAt(int idx) {
for (int j = idx; j < n-1; j++) {
data[j] = data[j+1];
}
n--;
}
But what do we do about nodes that become completey empty? And what do we do about the upper levels of the tree? After all, being an indexed structure, other values may be depending on that value to drive a search to them, or the value that indexed to the item being erased is a related albeit different key. We can't orphan the other values in a node when we erase one value from it.
In todays post I'm going to cover two implementations of a deletion method for B+ trees. In both methods underflow is allowed to occur during deletion in leaf nodes. In the first method, no changes are propagated above the leaf level. The second method does make changes to the index, and is much closer to the proper "balanced" deletion for B+ trees, but does not perform any redistribution of keys from sibling nodes should a node drop below M/2 keys. Instead we simply remove nodes when they become completely empty, greatly simplifying its implementation.
I'm going to be building off the bottom up B+ tree code presented in my post on Memory Resident B+ Trees so if you haven't read that post, you might want to give it a read first to get up to speed. With that said, lets do it.
Erasing Values While Preserving the Integrity of the Index
Removing a value from the index starts at the root, and works towards the leaf where if the value exists it will be removed as was depicted in the previous section. Regardless of if the current node is a leaf or not, the first thing we do is check the node for the key we want to remove.
If we are not at a leaf not, we first need to check if the result of the search for the key was successful. If the key we are searching for is not in the node, then we need to find the node with a the largest key still less than the key we are searching for. This is the purpose of the floor() method.
void remove(BPNode<K,V>* node, K key) {
int idx = node->find(key);
if (node->isLeaf()) {
if (idx != -1) {
node->eraseAt(idx);
}
} else {
idx = (idx == -1) ? node->floor(key):idx;
remove(node->at(idx).next, key);
if (node->at(idx).next->isEmpty()) {
node->eraseAt(idx);
}
}
}
After traversing the tree to the leaf (and presumably removing a value from the leaf level) we check each node as the stack unwinds for the case where we may have erased the only entry in a node, which would leave an empty node with nothing pointing at it. In this case, we delete both the empty node and the key in the index which pointed to the empty node. This process cascades upward towards the root. In this way, the viability of the Index is preserved, despite letting nodes underflow as we erase values.
max@MaxGorenLaptop:~/btrees$ ./bpt
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a a a ] [ b e ] [ e e e ] [ l l l ] [ m m ] [ p r ] [ s t x ]
Erase: a
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a a ] [ b e ] [ e e e ] [ l l l ] [ m m ] [ p r ] [ s t x ]
--------------
Erase: s
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a a ] [ b e ] [ e e e ] [ l l l ] [ m m ] [ p r ] [ t x ]
--------------
Erase: m
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a a ] [ b e ] [ e e e ] [ l l l ] [ m ] [ p r ] [ t x ]
--------------
Erase: a
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ b e ] [ e e e ] [ l l l ] [ m ] [ p r ] [ t x ]
--------------
Erase: l
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ b e ] [ e e e ] [ l l ] [ m ] [ p r ] [ t x ]
--------------
Erase: l
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ b e ] [ e e e ] [ l ] [ m ] [ p r ] [ t x ]
--------------
Erase: b
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ e ] [ e e e ] [ l ] [ m ] [ p r ] [ t x ]
--------------
Erase: t
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ e ] [ e e e ] [ l ] [ m ] [ p r ] [ x ]
--------------
Erase: r
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ e ] [ e e e ] [ l ] [ m ] [ p ] [ x ]
--------------
Erase: e
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ e ] [ e e ] [ l ] [ m ] [ p ] [ x ]
--------------
Erase: e
[ a e m ]
[ a b ] [ e l ] [ m p s ]
[ a ] [ e ] [ e ] [ l ] [ m ] [ p ] [ x ]
--------------
Erase: e
[ a e m ]
[ a b ] [ l ] [ m p s ]
[ a ] [ e ] [ l ] [ m ] [ p ] [ x ]
--------------
Erase: x
[ a e m ]
[ a b ] [ l ] [ m p ]
[ a ] [ e ] [ l ] [ m ] [ p ]
--------------
Erase: a
[ a e m ]
[ b ] [ l ] [ m p ]
[ e ] [ l ] [ m ] [ p ]
--------------
Erase: m
[ a e m ]
[ b ] [ l ] [ p ]
[ e ] [ l ] [ p ]
--------------
Erase: p
[ a e ]
[ b ] [ l ]
[ e ] [ l ]
--------------
Erase: l
[ a ]
[ b ]
[ e ]
--------------
Erase: e
--------------
A funny side effect of how deletion was implemented is, as you can see, the tree's height does not shrink as we remove items. That wont really do, so how can we also handle the gradual reduction of height that accompanies removing items? Well, actually it's easier than you'd think.
A Better Approach
Instead of just removing links to nodes when they become empty, we can also do a better job of maintaining the index by replacing the keys in the internal nodes when the value they direct to is removed from the tree.
link erase(link curr, K key, int ht) {
if (ht == 0) {
int j = curr->find(key);
if (j != -1) {
curr->eraseAt(j);
}
} else {
int j = curr->floor(key);
curr->page[j].next = erase(curr->page[j].next, key, ht-1);
if (curr->page[j].next->n == 0) {
curr->eraseAt(j);
} else if (key == curr->page[j].info.key()) {
curr->page[j] = entry(curr->page[j].next->page[0].info.key(), curr->page[j].next);
}
}
return curr;
}
Just as they can only grow from the root node, Btree's can only shrink from the root node. When the root node only has one key left, we simple replace it by it's child, making that the new root node.
void erase(K key) {
root = erase(root, key, height);
if (root->n == 1) {
link tmp = root;
root = root->page[0].next;
delete tmp;
height--;
}
}
Let's see how these changes stack up:
[ a i ]
[ a e ] [ i n r ]
[ a c ] [ e g h ] [ i l m ] [ n p ] [ r s x ]
Erase: a
[ c i ]
[ c e ] [ i n r ]
[ c ] [ e g h ] [ i l m ] [ n p ] [ r s x ]
------------
Erase: s
[ c i ]
[ c e ] [ i n r ]
[ c ] [ e g h ] [ i l m ] [ n p ] [ r x ]
------------
Erase: e
[ c i ]
[ c g ] [ i n r ]
[ c ] [ g h ] [ i l m ] [ n p ] [ r x ]
------------
Erase: r
[ c i ]
[ c g ] [ i n x ]
[ c ] [ g h ] [ i l m ] [ n p ] [ x ]
------------
Erase: c
[ g i ]
[ g ] [ i n x ]
[ g h ] [ i l m ] [ n p ] [ x ]
------------
Erase: h
[ g i ]
[ g ] [ i n x ]
[ g ] [ i l m ] [ n p ] [ x ]
------------
Erase: i
[ g l ]
[ g ] [ l n x ]
[ g ] [ l m ] [ n p ] [ x ]
------------
Erase: n
[ g l ]
[ g ] [ l p x ]
[ g ] [ l m ] [ p ] [ x ]
------------
Erase: g
[ l p x ]
[ l m ] [ p ] [ x ]
------------
Erase: x
[ l p ]
[ l m ] [ p ]
------------
Erase: m
[ l p ]
[ l ] [ p ]
------------
Erase: p
[ l ]
------------
Erase: l
[ ]
------------
Much better.
That's all I've got for today, until next time Happy Hacking!
-
Simplifying the erase operation for B+ trees
-
Building an AST from a Regular Expression
-
The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from Followpos
-
The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table
-
Procedural Map Generation with Binary Space Partitioning
-
Exact Match String Searching: The Knuth-Morris-Pratt Algorithm
-
The visitor pattern: OOP takes on the Expression Problem
-
Improving mgcLisp's define syntax
-
Separating the Men from the Boys, Knuth Style
-
Reducing rotations during deletion from AVL trees
Leave A Comment