Your default hash table should be open-addressed, using Robin Hood linear probing with backward-shift deletion.
When prioritizing deterministic performance over memory efficiency, two-way chaining is also a good choice.
Code for this article may be found on GitHub.
- Open Addressing vs. Separate Chaining
- Benchmark Setup
- Discussion
- Unrolling, Prefetching, and SIMD
- Benchmark Data
Most people first encounter hash tables implemented using separate chaining, a model simple to understand and analyze mathematically.
In separate chaining, a hash function is used to map each key to one of (K) buckets.
Each bucket holds a linked list, so to retrieve a key, one simply traverses its corresponding bucket.
Given a hash function drawn from a universal family, inserting (N) keys into a table with (K) buckets results in an expected bucket size of (frac{N}{K}), a quantity also known as the table’s load factor.
Assuming (reasonably) that (frac{N}{K}) is bounded by a constant, all operations on such a table can be performed in expected constant time.
Unfortunately, this basic analysis doesn’t consider the myriad factors that go into implementing an efficient hash table on a real computer.
In practice, hash tables based on open addressing can provide superior performance, and their limitations can be worked around in nearly all cases.
Open Addressing
In an open-addressed table, each bucket only contains a single key.
Collisions are handled by placing additional keys elsewhere in the table.
For example, in linear probing, a key is placed in the first open bucket starting from the index it hashes to.
Unlike in separate chaining, open-addressed tables may be represented in memory as a single flat array.
A priori, we should expect operations on flat arrays to offer higher performance than those on linked structures due to more coherent memory accesses.1
However, if the user wants to insert more than (K) keys into an open-addressed table with (K) buckets, the entire table must be resized.
Typically, a larger array is allocated and all current keys are moved to the new table.
The size is grown by a constant factor (e.g. 2x), so insertions occur in amortized constant time.
Why Not Separate Chaining?
In practice, the standard libraries of many languages provide separately chained hash tables, such as C++’s std::unordered_map
.
At least in C++, this choice is now considered to have been a mistake—it violates C++’s principle of not paying for features you didn’t ask for.
Separate chaining still has some purported benefits, but they seem unconvincing:
-
Separately chained tables don’t require any linear-time operations.
First, this isn’t true.
To maintain a constant load factor, separate chaining hash tables also have to resize once a sufficient number of keys are inserted, though the limit can be greater than (K).
Most separately chained tables, includingstd::unordered_map
, only provide amortized constant time insertion.Second, open-addressed tables can be incrementally resized.
When the table grows, there’s no reason we have to move all existing keys right away.
Instead, every subsequent operation can move a fixed number of old keys into the new table.
Eventually, all keys will have been moved and the old table may be deallocated.
This technique incurs overhead on every operation during a resize, but none will require linear time. -
When each (key + value) entry is allocated separately, entries can have stable addresses. External code can safely store pointers to them, and large entries never have to be copied.
This property is easy to support by adding a single level of indirection. That is, allocate each entry separately, but use an open-addressed table to map
keys to pointers-to-entries. In this case, resizing the table only requires copying pointers, rather than entire entries. -
Lower memory overhead.
It’s possible for a separately chained table to store the same data in less space than an open-addressed equivalent, since flat tables necessarily waste space storing empty buckets.
However, matching high-quality open addressing schemes (especially when storing entries indirectly) requires a load factor greater than one, degrading query performance.
Further, individually allocating nodes wastes memory due to heap allocator overhead and fragmentation.
In fact, the only situation where open addressing truly doesn’t work is when the table cannot allocate and entries cannot be moved.
These requirements can arise when using intrusive linked lists, a common pattern in kernel data structures and embedded systems with no heap.
For simplicity, let us only consider tables mapping 64-bit integer keys to 64-bit integer values.
Results will not take into account the effects of larger values or non-trivial key comparison, but should still be representative, as keys can often be represented in 64 bits and values are often pointers.
We will evaluate tables using the following metrics:
- Time to insert (N) keys into an empty table.
- Time to erase (N) existing keys.
- Time to look up (N) existing keys.
- Time to look up (N) missing keys2.
- Average probe length for existing & missing keys.
- Maximum probe length for existing & missing keys.
- Memory amplification in a full table (total size / size of data).
Each open-addressed table was benchmarked at 50%, 75%, and 90% load factors.
Every test targeted a table capacity of 8M entries, setting (N = lfloor 8388608 * text{Load Factor} rfloor – 1).
Fixing the table capacity allows us to benchmark behavior at exactly the specified load factor, avoiding misleading results from tables that have recently grown.
The large number of keys causes each test to unpredictably traverse a data set much larger than L3 cache (128MB+), so the CPU3 cannot effectively cache or prefetch memory accesses.
Table Sizes
All benchmarked tables use exclusively power-of-two sizes.
This invariant allows us to translate hashes to array indices with a fast bitwise AND operation, since:
h % 2**n == h & (2**n - 1)
Power-of-two sizes also admit a neat virtual memory trick4, but that optimization is not included here.
However, power-of-two sizes can cause pathological behavior when paired with some hash functions, since indexing simply throws away the hash’s high bits.
An alternative strategy is to use prime sized tables, which are significantly more robust to the choice of hash function.
For the purposes of this post, we have a fixed hash function, so prime sizes were not used5.
Hash Function
All benchmarked tables use the squirrel3
hash function, a fast integer hash that (as far as I can tell) only exists in this GDC talk on noise-based RNG.
Here, it is adapted to 64-bit integers by choosing three large 64-bit primes:
uint64_t squirrel3(uint64_t at) {
constexpr uint64_t BIT_NOISE1 = 0x9E3779B185EBCA87ULL;
constexpr uint64_t BIT_NOISE2 = 0xC2B2AE3D27D4EB4FULL;
constexpr uint64_t BIT_NOISE3 = 0x27D4EB2F165667C5ULL;
at *= BIT_NOISE1;
at ^= (at >> 8);
at += BIT_NOISE2;
at ^= (at << 8);
at *= BIT_NOISE3;
at ^= (at >> 8);
return at;
}
I make no claims regarding the quality or robustness of this hash function, but observe that it’s cheap, it produces the expected number of collisions in power-of-two tables, and it passes smhasher when applied bytewise.
Other Benchmarks
Several less interesting benchmarks were also run, all of which exhibited identical performance across equal-memory open-addressed tables and much worse performance for separate chaining.
- Time to look up each element by following a Sattolo cycle6 (2-4x).
- Time to clear the table (100-200x).
- Time to iterate all values (3-10x).
Separate Chaining
To get our bearings, let’s first implement a simple separate chaining table and benchmark it at various load factors.
50% Load | 100% Load | 200% Load | 500% Load | |||||
Existing | Missing | Existing | Missing | Existing | Missing | Existing | Missing | |
Insert (ns/key) | 120 | 130 | 130 | 151 | ||||
Erase (ns/key) | 112 | 135 | 179 | 305 | ||||
Lookup (ns/key) | 28 | 28 | 36 | 41 | 51 | 69 | 103 | 186 |
Average Probe | 1.25 | 0.50 | 1.50 | 1.00 | 2.00 | 2.00 | 3.50 | 5.00 |
Max Probe | 7 | 7 | 10 | 10 | 12 | 12 | 21 | 21 |
Memory Amplification | 2.50 | 2.00 | 1.75 | 1.60 |
These are respectable results, especially lookups at 100%: the CPU is able to hide much of the memory latency.
Insertions and erasures, on the other hand, are pretty slow—they involve allocation.
Technically, the reported memory amplification in an underestimate, as it does not include heap allocator overhead.
However, we should not directly compare memory amplification against the coming open-addressed tables, as storing larger values would improve separate chaining’s ratio.
std::unordered_map
Running these benchmarks on std::unordered_map
produces results roughly equivalent to the 100% column, just with marginally slower insertions and faster clears.
Using squirrel3
with std::unordered_map
additionally makes insertions slightly slower and lookups slightly faster.
Further comparisons will be relative to these results, since exact performance numbers for std::unordered_map
depend on one’s standard library distribution (here, Microsoft’s).
Linear Probing
Next, let’s implement the simplest type of open addressing—linear probing.
In the following diagrams, assume that each letter hashes to its alphabetical index. The skull denotes a tombstone.
-
Insert: starting from the key’s index, place the key in the first empty or tombstone bucket.
-
Lookup: starting from the key’s index, probe buckets in order until finding the key, reaching an empty non-tombstone bucket, or exhausting the table.
-
Erase: lookup the key and replace it with a tombstone.
The results are surprisingly good, considering the simplicity:
50% Load | 75% Load | 90% Load | ||||
Existing | Missing | Existing | Missing | Existing | Missing | |
Insert (ns/key) | 47 | 60 | 73 | |||
Erase (ns/key) | 21 | 36 | 47 | |||
Lookup (ns/key) | 21 | 86 | 35 | 122 | 47 | 291 |
Average Probe | 0.50 | 6.04 | 1.49 | 29.49 | 4.46 | 222.04 |
Max Probe | 43 | 85 | 181 | 438 | 1604 | 2816 |
Memory Amplification | 2.00 | 1.33 | 1.11 |
For existing keys, we’re already beating the separately chained tables, and with lower memory overhead!
Of course, there are also some obvious problems—looking for missing keys takes much longer, and the worst-case probe lengths are quite scary, even at 50% load factor.
But, we can do much better.
Erase: Rehashing
One simple optimization is automatically recreating the table once it accumulates too many tombstones.
Erase now occurs in amortized constant time, but we already tolerate that for insertion, so it shouldn’t be a big deal.
Let’s implement a table that re-creates itself after erase has been called (frac{N}{2}) times.
Compared to naive linear probing:
50% Load | 75% Load | 90% Load | ||||
Existing | Missing | Existing | Missing | Existing | Missing | |
Insert (ns/key) | 46 | 57 | 70 | |||
Erase (ns/key) | 23 (+8.3%) | 27 (-24.6%) | 28 (-39.7%) | |||
Lookup (ns/key) | 20 | 84 | 34 | 89 (-26.8%) | 46 | 142 (-51.2%) |
Average Probe | 0.50 | 6.04 | 1.49 | 9.34 (-68.3%) | 4.46 | 57.84 (-74.0%) |
Max Probe | 43 | 85 | 181 | 245 (-44.1%) | 1604 | 1405 (-50.1%) |
Memory Amplification | 2.00 | 1.33 | 1.11 |
Clearing the tombstones dramatically improves missing key queries, but they are still pretty slow.
It also makes erase slower at 50% load—there aren’t enough tombstones to make up for having to recreate the table.
Rehashing also does nothing to mitigate long probe sequences for existing keys.
Erase: Backward Shift
Actually, who says we need tombstones? There’s a little-taught, but very simple algorithm for erasing keys that doesn’t degrade performance or have to recreate the table.
When we erase a key, we don’t know whether the lookup algorithm is relying on our bucket being filled to find keys later in the table.
Tombstones are one way to avoid this problem.
Instead, however, we can guarantee that traversal is never disrupted by simply removing and reinserting all following keys, up to the next empty bucket.
Note that despite the name, we are not simply shifting the following keys backward. Any keys that are already at their optimal index will not be moved.
For example:
After implementing backward-shift deletion, we can compare it to linear probing with rehashing:
50% Load | 75% Load | 90% Load | ||||
Existing | Missing | Existing | Missing | Existing | Missing | |
Insert (ns/key) | 48 | 60 | 73 | |||
Erase (ns/key) | 47 (+107%) | 83 (+208%) | 149 (+425%) | |||
Lookup (ns/key) | 20 | 38 (-55.1%) | 36 | 89 | 46 | 134 |
Average Probe | 0.50 | 1.50 (-75.2%) | 1.49 | 7.46 (-20.1%) | 4.46 | 49.7 (-14.1%) |
Max Probe | 43 | 50 (-41.2%) | 181 | 243 | 1604 | 1403 |
Memory Amplification | 2.00 | 1.33 | 1.11 |
Naturally, erasures become significantly slower, but queries on missing keys now have reasonable behavior, especially at 50% load.
In fact, at 50% load, this table already beats the performance of equal-memory separate chaining in all four metrics.
But, we can still do better: a 50% load factor is unimpressive, and max probe lengths a