Iterable Hash Tables Redux: Separate Chaining
In my earlier post on implementing Iterators for hash table based data structures I covered how to write an iterator for tables utilizing open addressing collision strategies - quadratic probing to be precise. In this post I'll be covering implementing Iterators for a hash table that uses separate chaining for handling collisions.
Separate chaining is one of the more versatile of the collision resolution strategies, as well as one of the first. For those interested in a bit of data structure history/trivia: the first description of a linked list was in a 1952 internal IBM paper discussing hash tables.
The difference between how open addressing as compared to separate chaining is implemented, is large enough that an iterator for one is not interoperable between the two. It is my belief that the challenges which arise in implementing an iterator for a separate chaining hash table are both unique and interesting in their own right and thus warrant their own separate post.
A Hash Set
Two of the most common uses for hash tables are for implementing maps and for implementing sets. Having covered maps in the previous post, I figured in this post I would implement a set - but this is superfluous, the ideas here extend to either.
Separate chaining is a straightforward concept: Instead of an array of single entry slots, we use an array of linked lists. Determining the initial slot is business as usual for a hash table, i.e. compute the index from the key, and then you simply add any item that hashes to that slot the linked list.
Instead of re-inventing the wheel, I will be use the code from https://github.com/maxgoren/IterableSet/blob/main/LinkedSet.hpp for the linked list. It has a few things going for it that make the job easier, mainly being that it already has the code that enforces uniqueness - which we will need for each of our buckets - a set should not contain duplicate items. It also supports it's own iterator, which will again make things easier for us.
template <class T, class hasher = hashfn<T>>
class IterableSet {
private:
LinkedSet<T>* table;
int n;
int maxn;
void init(int max) {
maxn = max;
n = 0;
table = new LinkedSet<T>[maxn];
}
class Iterator {
/* hold your horses, I'm getting there. */
};
public:
IterableSet(int max = 113) {
init(max);
}
~IterableSet() {
delete [] table;
}
int size() const {
return n;
}
int max_size() const {
return maxn;
}
bool isEmpty() const {
return n == 0;
}
void put(T info) {
unsigned int idx = hasher()(info) % maxn;
table[idx].add(info);
n++;
}
void remove(T info) {
unsigned int idx = hasher()(info) % maxn;
table[idx].remove(info);
n++;
}
Iterator find(T info) {
unsigned int idx = hasher()(info) % maxn;
return table[idx].find(info) == table[idx].end() ? end():Iterator(table+idx, info);
}
Iterator begin() {
return Iterator(table);
}
Iterator end() {
return Iterator(table+maxn);
}
};
If you're not quite sure what's going on here, now would probably be a good time to pause, and re-familiarize yourself with how separate chaining hash tables work, as that is not the focus of this post.
The Iterator
What makes implementing an Iterator for separate chaining tables, is that we have to handle two different types of iteration with one iterator. On the one hand, our iterator needs to behave like an array iterator when changing buckets. On the other hand, while we are in a valid bucket it needs to iterate over a linked structure.
Choosing to use a linked structure for which we already have an iterator simplifies the implementation greatly. For the open addressing iterator, only one internal object was needed, a pointer into the table. Now, we need to maintain two internal objects: a pointer in to the table, and a LinkedSet<T>::Iterator. You'll see why in a moment.
class Iterator {
private:
LinkedSet<T>* __setPtr;
typename LinkedSet<T>::Iterator __setIter;
public:
The constructor for our Iterator will be very similar to how it was for the open addressing iterator: Because hash tables are randomly ordered, were not guaranteed that table[0] is the first entry - it might be unoccupied. This requires us to perform a linear scan of the table from the first slot, stopping at the first occupied slot. Once this is done, we can set our internal __setIter to point to the LinkedSet at that slot's begin().
Iterator(LinkedSet<T>* __Ptr) {
__setPtr = __Ptr;
while (__setPtr && __setPtr->isEmpty()) {
__setPtr++;
}
if (__setPtr)
__setIter = __setPtr->begin();
}
Iterator() { }
Additionally, we need to implement a constructor that also advances __setIter to a known value inside the LinkedSet at a given slot:
Iterator(LinkedSet* __Ptr, T info) {
__setPtr = __Ptr;
__setIter = __Ptr.begin();
while (__setIter != __Ptr->end() && *__setIter != info) __setIter++;
}
Between those two constructors, you can probably see how this is coming together. Our Iterator is little more than a proxy for the underlying LinkedSet<T>'s iterators. Any doubts to the veracity of this statement should be cleared by the implementation of operator*() and operator++(). Put simply: It's our iterators job to ensure the seamless transmission from bucket to bucket, handing control back over to the LinkedSet<T>'s Iterator when their are values present. This ultimately enables Iteration of the entire collection.
T& operator*() {
return *__setIter;
}
Iterator operator++() {
if (__setIter != __setPtr->end()) {
++__setIter;
if (__setIter != __setPtr->end()) {
return *this;
}
}
do {
__setPtr++;
} while (__setPtr->isEmpty());
__setIter = __setPtr->begin();
return *this;
}
Because of this complicated dance between the two methods of Iteration, it takes a little fancy footwork to ensure that when operator++() is called, that the correct incrementation takes place: First we check if __setIter is pointing to it's end sentinel, if not than we can increment it, but before returning the iterator, we must check if THIS incrementation reached the end of the bucket, if not we're good and can return our iterator. However, if it has caused us to reach the end of the current bucket, we instead need to find the next valid bucket. The crux of the whole thing is that the only time incrementation should cease, is once we have no more buckets to check for additional values.
Post-increment operator is implemented in the standard way, as are the equality operators, though note that we must compare both __setPtr and __setIter as both need to match.
Iterator operator++(int) {
Iterator it = *this;
++*this;
return it;
}
bool operator==(const Iterator& o) const noexcept {
return __setPtr == o.__setPtr && __setIter == o.__setIter;
}
bool operator!=(const Iterator& o) const noexcept {
return !(*this==o);
}
};
And that's all there is to it! As before, we can now use C++'s enhanced for loop to iterate over our set, once supplied with an adequate begin() and end() methods.
#include <iostream>
#include "iterable_hashset.hpp"
using namespace std;
int main() {
string sed = "ahashsetexample";
IterableSet<char> hset(7);
for (char c : sed) {
hset.put(c);
}
for (char c : hset) {
cout<<c<<" ";
}
cout<<endl;
return 0;
}
max@MaxGorenLaptop:~/DS/Algorithms$ ./a.out
p s t x a e h l m
max@MaxGorenLaptop:~/DS/Algorithms$
Now, no matter whether you choose to use open addressing or separate chaining, you have what you need to implement an iterator for your collection.
Until next time, Happy Hacking!
-
Pratt Parsing: Top-Down Operator Precedence
-
Generating P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
-
Lossless Compression Part I: Working with Bits in a Byte Oriented World
-
Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree
-
A Different Take on Merge Sort: Binary Search Trees?
-
Deleting Entries From Open-Address Hash tables
Leave A Comment