Probabilistic data structures are one of the most overlooked topics in computer science: bloom filters, HyperLogLog counters, approximate nearest neighbors, random projections should be known by programmers as well as sorting algorithms, hashtables or heaps. It’s one field that can really make a difference, in many cases, in terms of performance and resources saving.
Maybe the “probability” part frightens programmers who are often good in logic, but not so much in mathematics. You shouldn’t be afraid of these algorithms, they are not so hard to understand, and a good intuition can help a lot: a good understanding of random number generation, hashing and collisions is helpful.
In this article I’ll write about how to count events of a very large variety, i.e. how to store a lot of counts. Starting from Count-Min Sketches, one of the early probabilistic structures, then the logarithmic counting known as Morris counters, and finally other approaches that finally led us to our proposal of the Count-Min Tree Sketches.
The straight way to counting
First, a little bit of popular science: in computer science, what is the way to go is you want to store counts? For instance, if you want to know how many times posts have been liked, or how many times hashtags have been twitted during a day?
The structure that one would normally use is a (key,value) collection or associative array, where the key contains the label of the thing being counted, and the value contains the number itself. To be able to actually count, you need a collection that can be modified, with the ability to access the value in constant or logarithmic time. To do that, you would use a hashtable or a tree map.
Counting approximately with Count-Min Sketches
In 2005, a paper by Cormode & Muthukrishnan described the Count-Min Sketch (CMS), a probabilistic data structure, with some inspiration taken from the Bloom Filter (and their counting variant), that can very efficiently store counts associated with items. The main difference between the Bloom approach and the CMS approach is that the CMS separates the memory in several layers and uses different hashes for each layer, while the Bloom approaches uses different hashes in just one memory segment.
There are two operations associated with a CMS:
- Update(item,value) and
- Query(item).
Update() adds a value to the counts stored for the items, while Query() returns the estimate of the count for the item.
The great idea behind the CMS is that, even if there are some collisions (and there will be necessarily), one can minimize the estimation error by keeping only the MIN value stored in the cells corresponding to an item. Indeed, all other cells necessarily overestimate the real count.

The illustration above shows a CMS with 3 layers of 20 cells. It has been updated by counting two items : X and Y. X has been seen 2 times, Y 3 times. Because the hash of X and Y is the same for the layer 0, there is a collision, and thus, the cell contains the sum of X and Y counts. When querying the CMS for the count of X, the values extracted for X are (5,2,2). Thanks to the MIN rule, the CMS returns only the minimum value, and consequently, 2 is correctly returned.
The CMS can support positive and negative updates. If we add the constraint of positive-only updates, we can use the Conservative Update variant. In this variant, only the cells which are already at the minimum value are going to be incremented (at least if we only do increments of one, otherwise the rule is slightly more complicated).
Evaluation
The following graphs show the estimation error for the original CMS and the CMS with conservative update (CMS-CU) both with 3 layers, with size varying from 12.5% to 800% of the perfect size (the size of an array containing just the counts, which would be doable with a perfect hash function).
The first graph shows the absolute error (RMSE), after counting a million words of a Wikipedia dump, plus bigrams (i.e., for the sentence “the little cat”, we actually count “the”,”little”,”cat” and “the little” and “little cat”). These million words ends up creating a hashtable with 523563 entries (30MiB based on our estimation). The perfect size is computed on 32-bits counters, totaling 2MiB. Consequently, the storage required for this hashtable is 1500% the perfect size.

In this graph, the middle vertical lines shows the perfect size. Measuring absolute error is not a neutral choice, because most of the values when counting unigrams and bigrams are at one (1), because a lot of bigrams are seen only once. When measuring absolute error, estimating 101 instead of 1 is the same than estimating 100100 instead of 100000 (and you may agree that it’s actually not the same king of error).
So we also need to measure the average relative error:

It’s close to the first graph, but it doesn’t mean the same thing AT ALL ! At the perfect size, a CMS has an average relative error of 1 (100%), which means that, on average, instead of 1 it will estimate 2, instead of 100, it will estimate 200, etc. For a relative error of 1% (i.e. probably enough for any practical purpose), you need 800% of the perfect size. That’s half the memory requirement of a hashtable, which is great. But is it good enough?
In 2014, we started to work on a streamed version of our text mining tool, eXenGine. As a consequence we started to gain interest in Count-Min Sketch, but quickly became disappointed in the size gain : a Count-Min Sketch can approximate values with a relative error of 1%, at half the size of a hashtable, which is interesting, but not a game changer.
Counting approximately approximately
It wasn’t good enough for us. Given the number of items that we wanted to count, halving the memory requirement was not a good enough target. So we started to ponder on better systems, with two objectives in mind:
- the error measurement should be the Average Relative Error, because I will either use the estimation for cut-off (ignoring very rare events), or for computing the Pointwise Mutual Information or the TF-IDF, which both take a logarithm of the counts at some point.
- the distribution of the values is highly skewed (text data usually follow the Zipf Law), so storing small values should be super-efficient.
At first (in 2014), we came up with the idea of using approximate