Optimizing Open Addressing by TheNumbat

Share This Article

Sed ut perspiciatis unde.

Your default hash table should be open-addressed, using Robin Hood linear probing with backward-shift deletion.
When prioritizing deterministic performance over memory efficiency, two-way chaining is also a good choice.

Code for this article may be found on GitHub.

Most people first encounter hash tables implemented using separate chaining, a model simple to understand and analyze mathematically.
In separate chaining, a hash function is used to map each key to one of (K) buckets.
Each bucket holds a linked list, so to retrieve a key, one simply traverses its corresponding bucket.

Given a hash function drawn from a universal family, inserting (N) keys into a table with (K) buckets results in an expected bucket size of (frac{N}{K}), a quantity also known as the table’s load factor.
Assuming (reasonably) that (frac{N}{K}) is bounded by a constant, all operations on such a table can be performed in expected constant time.

Unfortunately, this basic analysis doesn’t consider the myriad factors that go into implementing an efficient hash table on a real computer.
In practice, hash tables based on open addressing can provide superior performance, and their limitations can be worked around in nearly all cases.

Open Addressing

In an open-addressed table, each bucket only contains a single key.
Collisions are handled by placing additional keys elsewhere in the table.
For example, in linear probing, a key is placed in the first open bucket starting from the index it hashes to.

Unlike in separate chaining, open-addressed tables may be represented in memory as a single flat array.
A priori, we should expect operations on flat arrays to offer higher performance than those on linked structures due to more coherent memory accesses.¹

However, if the user wants to insert more than (K) keys into an open-addressed table with (K) buckets, the entire table must be resized.
Typically, a larger array is allocated and all current keys are moved to the new table.
The size is grown by a constant factor (e.g. 2x), so insertions occur in amortized constant time.

Why Not Separate Chaining?

In practice, the standard libraries of many languages provide separately chained hash tables, such as C++’s std::unordered_map.
At least in C++, this choice is now considered to have been a mistake—it violates C++’s principle of not paying for features you didn’t ask for.

Separate chaining still has some purported benefits, but they seem unconvincing:

Separately chained tables don’t require any linear-time operations.

First, this isn’t true.
To maintain a constant load factor, separate chaining hash tables also have to resize once a sufficient number of keys are inserted, though the limit can be greater than (K).
Most separately chained tables, including std::unordered_map, only provide amortized constant time insertion.

Second, open-addressed tables can be incrementally resized.
When the table grows, there’s no reason we have to move all existing keys right away.
Instead, every subsequent operation can move a fixed number of old keys into the new table.
Eventually, all keys will have been moved and the old table may be deallocated.
This technique incurs overhead on every operation during a resize, but none will require linear time.
When each (key + value) entry is allocated separately, entries can have stable addresses. External code can safely store pointers to them, and large entries never have to be copied.

This property is easy to support by adding a single level of indirection. That is, allocate each entry separately, but use an open-addressed table to map
keys to pointers-to-entries. In this case, resizing the table only requires copying pointers, rather than entire entries.
Lower memory overhead.

It’s possible for a separately chained table to store the same data in less space than an open-addressed equivalent, since flat tables necessarily waste space storing empty buckets.
However, matching high-quality open addressing schemes (especially when storing entries indirectly) requires a load factor greater than one, degrading query performance.
Further, individually allocating nodes wastes memory due to heap allocator overhead and fragmentation.

In fact, the only situation where open addressing truly doesn’t work is when the table cannot allocate and entries cannot be moved.
These requirements can arise when using intrusive linked lists, a common pattern in kernel data structures and embedded systems with no heap.

For simplicity, let us only consider tables mapping 64-bit integer keys to 64-bit integer values.
Results will not take into account the effects of larger values or non-trivial key comparison, but should still be representative, as keys can often be represented in 64 bits and values are often pointers.

We will evaluate tables using the following metrics:

Time to insert (N) keys into an empty table.
Time to erase (N) existing keys.
Time to look up (N) existing keys.
Time to look up (N) missing keys².
Average probe length for existing & missing keys.
Maximum probe length for existing & missing keys.
Memory amplification in a full table (total size / size of data).

Each open-addressed table was benchmarked at 50%, 75%, and 90% load factors.
Every test targeted a table capacity of 8M entries, setting (N = lfloor 8388608 * text{Load Factor} rfloor – 1).
Fixing the table capacity allows us to benchmark behavior at exactly the specified load factor, avoiding misleading results from tables that have recently grown.
The large number of keys causes each test to unpredictably traverse a data set much larger than L3 cache (128MB+), so the CPU³ cannot effectively cache or prefetch memory accesses.

Table Sizes

All benchmarked tables use exclusively power-of-two sizes.
This invariant allows us to translate hashes to array indices with a fast bitwise AND operation, since:

h % 2**n == h & (2**n - 1)

Power-of-two sizes also admit a neat virtual memory trick⁴, but that optimization is not included here.

However, power-of-two sizes can cause pathological behavior when paired with some hash functions, since indexing simply throws away the hash’s high bits.
An alternative strategy is to use prime sized tables, which are significantly more robust to the choice of hash function.
For the purposes of this post, we have a fixed hash function, so prime sizes were not used⁵.

Hash Function

All benchmarked tables use the squirrel3 hash function, a fast integer hash that (as far as I can tell) only exists in this GDC talk on noise-based RNG.

Here, it is adapted to 64-bit integers by choosing three large 64-bit primes:

uint64_t squirrel3(uint64_t at) {
    constexpr uint64_t BIT_NOISE1 = 0x9E3779B185EBCA87ULL;
    constexpr uint64_t BIT_NOISE2 = 0xC2B2AE3D27D4EB4FULL;
    constexpr uint64_t BIT_NOISE3 = 0x27D4EB2F165667C5ULL;
    at *= BIT_NOISE1;
    at ^= (at >> 8);
    at += BIT_NOISE2;
    at ^= (at << 8);
    at *= BIT_NOISE3;
    at ^= (at >> 8);
    return at;
}

I make no claims regarding the quality or robustness of this hash function, but observe that it’s cheap, it produces the expected number of collisions in power-of-two tables, and it passes smhasher when applied bytewise.

Other Benchmarks

Several less interesting benchmarks were also run, all of which exhibited identical performance across equal-memory open-addressed tables and much worse performance for separate chaining.

Time to look up each element by following a Sattolo cycle⁶ (2-4x).
Time to clear the table (100-200x).
Time to iterate all values (3-10x).

Separate Chaining

To get our bearings, let’s first implement a simple separate chaining table and benchmark it at various load factors.

50% Load		100% Load		200% Load		500% Load
Existing	Missing	Existing	Missing	Existing	Missing	Existing	Missing
Insert (ns/key)	120		130		130		151
Erase (ns/key)	112		135		179		305
Lookup (ns/key)	28	28	36	41	51	69	103	186
Average Probe	1.25	0.50	1.50	1.00	2.00	2.00	3.50	5.00
Max Probe	7	7	10	10	12	12	21	21
Memory Amplification	2.50		2.00		1.75		1.60

These are respectable results, especially lookups at 100%: the CPU is able to hide much of the memory latency.
Insertions and erasures, on the other hand, are pretty slow—they involve allocation.

Technically, the reported memory amplification in an underestimate, as it does not include heap allocator overhead.
However, we should not directly compare memory amplification against the coming open-addressed tables, as storing larger values would improve separate chaining’s ratio.

`std::unordered_map`

Running these benchmarks on std::unordered_map produces results roughly equivalent to the 100% column, just with marginally slower insertions and faster clears.
Using squirrel3 with std::unordered_map additionally makes insertions slightly slower and lookups slightly faster.
Further comparisons will be relative to these results, since exact performance numbers for std::unordered_map depend on one’s standard library distribution (here, Microsoft’s).

Linear Probing

Next, let’s implement the simplest type of open addressing—linear probing.

In the following diagrams, assume that each letter hashes to its alphabetical index. The skull denotes a tombstone.

Insert: starting from the key’s index, place the key in the first empty or tombstone bucket.
Lookup: starting from the key’s index, probe buckets in order until finding the key, reaching an empty non-tombstone bucket, or exhausting the table.
Erase: lookup the key and replace it with a tombstone.

The results are surprisingly good, considering the simplicity:

50% Load		75% Load		90% Load
Existing	Missing	Existing	Missing	Existing	Missing
Insert (ns/key)	47		60		73
Erase (ns/key)	21		36		47
Lookup (ns/key)	21	86	35	122	47	291
Average Probe	0.50	6.04	1.49	29.49	4.46	222.04
Max Probe	43	85	181	438	1604	2816
Memory Amplification	2.00		1.33		1.11

For existing keys, we’re already beating the separately chained tables, and with lower memory overhead!
Of course, there are also some obvious problems—looking for missing keys takes much longer, and the worst-case probe lengths are quite scary, even at 50% load factor.
But, we can do much better.

Erase: Rehashing

One simple optimization is automatically recreating the table once it accumulates too many tombstones.
Erase now occurs in amortized constant time, but we already tolerate that for insertion, so it shouldn’t be a big deal.
Let’s implement a table that re-creates itself after erase has been called (frac{N}{2}) times.

Compared to naive linear probing:

50% Load		75% Load		90% Load
Existing	Missing	Existing	Missing	Existing	Missing
Insert (ns/key)	46		57		70
Erase (ns/key)	23 (+8.3%)		27 (-24.6%)		28 (-39.7%)
Lookup (ns/key)	20	84	34	89 (-26.8%)	46	142 (-51.2%)
Average Probe	0.50	6.04	1.49	9.34 (-68.3%)	4.46	57.84 (-74.0%)
Max Probe	43	85	181	245 (-44.1%)	1604	1405 (-50.1%)
Memory Amplification	2.00		1.33		1.11

Clearing the tombstones dramatically improves missing key queries, but they are still pretty slow.
It also makes erase slower at 50% load—there aren’t enough tombstones to make up for having to recreate the table.
Rehashing also does nothing to mitigate long probe sequences for existing keys.

Erase: Backward Shift

Actually, who says we need tombstones? There’s a little-taught, but very simple algorithm for erasing keys that doesn’t degrade performance or have to recreate the table.

When we erase a key, we don’t know whether the lookup algorithm is relying on our bucket being filled to find keys later in the table.
Tombstones are one way to avoid this problem.
Instead, however, we can guarantee that traversal is never disrupted by simply removing and reinserting all following keys, up to the next empty bucket.

Note that despite the name, we are not simply shifting the following keys backward. Any keys that are already at their optimal index will not be moved.
For example:

After implementing backward-shift deletion, we can compare it to linear probing with rehashing:

50% Load		75% Load		90% Load
Existing	Missing	Existing	Missing	Existing	Missing
Insert (ns/key)	48		60		73
Erase (ns/key)	47 (+107%)		83 (+208%)		149 (+425%)
Lookup (ns/key)	20	38 (-55.1%)	36	89	46	134
Average Probe	0.50	1.50 (-75.2%)	1.49	7.46 (-20.1%)	4.46	49.7 (-14.1%)
Max Probe	43	50 (-41.2%)	181	243	1604	1403
Memory Amplification	2.00		1.33		1.11

Naturally, erasures become significantly slower, but queries on missing keys now have reasonable behavior, especially at 50% load.
In fact, at 50% load, this table already beats the performance of equal-memory separate chaining in all four metrics.
But, we can still do better: a 50% load factor is unimpressive, and max probe lengths a

Optimizing Open Addressing by TheNumbat

Optimizing Open Addressing by TheNumbat

Share This Article

Newsletter

Open Addressing

Why Not Separate Chaining?

Table Sizes

Hash Function

Other Benchmarks

Separate Chaining

`std::unordered_map`

Linear Probing

Erase: Rehashing

Erase: Backward Shift

HackTech

Leave a comment Cancel reply

Editor's Choice

Optimizing Open Addressing by TheNumbat

Optimizing Open Addressing by TheNumbat

Share This Article

Newsletter

Open Addressing

Why Not Separate Chaining?

Table Sizes

Hash Function

Other Benchmarks

Separate Chaining

std::unordered_map

Linear Probing

Erase: Rehashing

Erase: Backward Shift

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter

`std::unordered_map`