Lets Talk Selection
Often times when we have a collection of values, we are more interested in the order statistics of the collection: find the min value, find the max value, find the K'th smallest value, and what have you. There are a number of data structures that efficiently support these operations: a binary search tree can support find min/max in O(log N) and find the Kth in O(logN+k). A binary heap will have the min or max (depending on type of heap) at the root and so is well suited for the task, being able to build a heap in linear time bottom up. In a sorted array, all three of the above are mentioned operations can be performed in O(1).
But if we have our data stored as an unordered collection, performing these tasks efficiently becomes a somewhat more complicated matter. Let us focus on finding the K'th element in an unordered collection. The naiive approach would be to run selection sort for the first K items of the collection:
int selectKth(int a[], int l, int r, int k) {
for (int i = l; i <= l+k; i++) {
int min = a[i];
for (int j = l; j <= r; j++) {
if (a[j] < a[min])
min = j;
}
if (min != i)
swap(a, min, i);
}
return a[l+k];
}
Like the sorting method its based on, the above algorithm also suffers from exponential worst case comparisons, but it highlights a basic theme that we will see pop up in every in-place solution to this problem: partial sorting of the list. If extra memory is available and cheap, then a straight forward way to perform such a query is with a priority queue. Simple place all the values in the queue, and then remove the first K-1 values, the Kth value will then be at the "top" of the priority queue.
int selectKth(int a[], int n, int k) {
priority_queue<int, vector<int>, greater<int>> pq;
for (int i = 0; i < n; i++) pq.push(a[i]);
for (int i = 1; i < k; i++) pq.pop();
return pq.top();
}
Take a look at the selection sort based implementation. Now take a look at the priority queue based implementation. I'll give you a moment.
The Heap Select Algorithm
What makes heaps more than just an interesting aside in computer science books regarding priority queues, and what makes heapsort efficient, is that the heap can be constructed in place. We've already seen that we can perform selection through partial sorting, so why not use heap sort as the sorting method?
void exch(int a[], int l, int r) {
int t = a[l]; a[l] = a[r]; a[r] = t;
}
void downheap(int a[], int n, int k) {
while (2*k <= n) {
int j = 2*k;
if (j < n && a[j] < a[j+1]) j++;
if (!(a[k] < a[j])) break;
exch(a, k, j);
k = j;
}
}
int heapselect(int a[], int l, int r, int k) {
int n = r-l;
int *pq = a+l;
for (int i = n/2; i >= 0; i--) {
downheap(pq, n, i);
}
for (int i = n; i >= k; i--) {
exch(pq, 0, n--);
downheap(pq, n, 0);
}
return a[k];
}
This is the max-heap implementation, which has the somewhat unfortunate property of sorting n-k items of the collection instead of just k. This could of course be addressed by implementing heapsort using a minheap - but there's a reason you've never seen it done that way. It adds alot of unneeded complexity, especially when there is a still yet better performing algorithm for the task at hand...
The Quickselect Algorithm
Using the data structure analogy from above, we can consider selection with a heap as selection by sorting. We can think of quickselect as selection by partitioning - or to keep with the data structure analogy, as using a binary search tree. The partition algorithm from quicksort - either Hoare's or Lamuto's will do the job. I prefer Hoare's version:
int partition(int a[], int l, int r) {
int i = l-1, j = r, v = a[r];
for (;;) {
while (a[++i] < v);
while (a[--j] > v) if (j == l) break;
if (i >= j) break;
swap(a[i], a[j]);
}
swap(a[i], a[r]);
return i;
}
Quickselect was invented by Tony Hoare and published at the same time as Quicksort and Partition. If you recall how quicksort it works, it partitions an array around a pivot point like the root of a binary search tree: all values less then the pivot go to the left, all values greater go to the right. At the end of a pass, the chosen pivot is in its sorted position. This means we just need to proceed with partitioning the array like quicksort until pivot == k! The really great thing about selection by partitioning though, is that its run time is linear with very high probability.
int quickselect(int a[], int l, int r, int k) {
while (r > l) {
int i = partition(a, l, r);
if (i == k) return a[i];
if (i >= k) r = i - 1;
if (i <= k) l = i + 1;
}
return a[k];
}
Other Algorithms For Selection
There are couple of other popular selection algorithms that I want to mention, but wont be covering in as much depth as those above. The first method I want to touch on is called Introselect and comes from the same effort by David Musser as Introsort.
Introselect and Introsort are both algorithms developed for the C++ standard template library to offer worst case complexity guarantees. Introselect's relationship to Introsort has cause some confusion, as they do not mirror each other the same way quicksort and quickselect do. Introsort famously avoids quicksort's worst case O(n^2) complexity by falling back to heapsort. This has led some to believe that Introselect falls back on heapselect: It does not. Of course, as one would expect, this has led to encountering the occasional "mistaken implementation" that does blend quickselect and heapselect as mentioned. Whatever you wish to call it (frankenselect?) it is not introselect.
Introselect instead makes use of the "median of medians" algorithm to aid om pivot selection for the partitioning phase. Ideally, a good pivot value will split an array (roughly) in half. Otherwise, like quicksort it can become exponential for some inputs. Median of medians greatly reduces the chances of choosing pad pivot values, and is an interesting algorithm in its own right that really deserves its own post. But for now, that's all I've got, so until next time: Happy Hacking!
-
Pratt Parsing: Top-Down Operator Precedence
-
Generating P-Code by AST Traversal
-
Ternary Search Tries: String Specific Ordered Symbol Tables
-
Digital Search Trees
-
Lossless Compression Part III: Huffman Coding
-
Lossless Compression Part II: The LZ77 Algorithm
-
Lossless Compression Part I: Working with Bits in a Byte Oriented World
-
Bottom Up AVL Tree: The OG Self-Balancing Binary Search Tree
-
A Different Take on Merge Sort: Binary Search Trees?
-
Deleting Entries From Open-Address Hash tables
Leave A Comment