Use fast data algorithms (2021) by Ice_cream_suit

Share This Article

Sed ut perspiciatis unde.

As an engineer who primarily works with data and databases I spend a lot of
time moving data around, hashing it, compressing it, decompressing it and
generally trying to shovel it between VMs and blob stores over TLS. I am
constantly surprised by how many systems only support slow, inefficient, and
expensive ways of doing these operations.

In my experience, these poor algorithm choices are orders of magnitude slower
than modern alternatives. Indeed, using a fast algorithm can often be the
difference between actually doing compression/hashing/encryption and “Eh, I’ll
skip it”. In the interest of using more of our CPU time to do useful work
instead of melting ice-caps and giving programmers excessively long coffee
breaks, we will explore some faster alternatives in this post.

TLDR

Disclaimer: There are lies, damn lies, and benchmarks from some random person
on the internet.

If you are considering taking some of the advice in this
post please remember to test your specific workloads, which might have
different bottlenecks. Also the implementation quality in your particular
software stack for your particular hardware matters a lot.

For this post I’ll be playing with a ~5 GiB real-world JSON
dataset on my laptop’s
Intel Core i7-8565U pinned to
4GHz.
Since I want to benchmark the algorithms instead of disks I’ll be pre-touching
the file into RAM with vmtouch. Remember that
on most modern cloud servers with fast NVMe storage (multiple GiBps) and
good page-caching algorithms your disks are likely not your bottleneck.

$ du -shc yelp_academic_dataset_review.json
6.5G    yelp_academic_dataset_review.json
$ vmtouch -t yelp_academic_dataset_review.json
# ...
$ vmtouch yelp_academic_dataset_review.json
           Files: 1
     Directories: 0
  Resident Pages: 1693525/1693525  6G/6G  100%
         Elapsed: 0.1256 seconds

Hashing

I would like to check that this blob of data over here is the same as that
data over there.

Trusted Data Hashes

These hash or checksum functions are used to ensure data integrity and usually
are defending against bugs/bitrot/cosmic rays instead of malicious attackers.
I typically see the following poor choices:

md5: This hash is both weak and slow. It does have the advantage of
being one of the fastest slow choices in standard libraries and therefore it
is somewhat common. Two of my favorite examples are slow Cassandra Quorum
Reads and slow S3
upload/download (seriously, just try disabling md5 on parts and see how
much faster S3 is).
crc32 or adler32: I often hear “well crc32 is fast”, but to be honest I
haven’t in practice come across particularly fast implementations in
real-world deployments. Sure theoretically there are hardware
implementations, but at least in most Java or Python programs I profile
running on most servers I encounter these checksums are not particularly fast
and only generate 32 bits of output where I absolutely have to care about
collisions.
sha2: An oddly common choice for trusted data hashes. Odd because it is
slow and you don’t need a cryptographic hash for syncing a file
between two hosts within an internal network with an authorized transfer
protocol (e.g. rsync or via backups to S3 with proper AWS
IAM
policies). If attackers have access to your backups or artifact store you
have a bigger problem than them forging a hash.

Faster Choice

Try xxHash. It is blazing fast, high
quality, and the XXH64 variant is usually sufficient for most data integrity
checks. It even performs well on small data inputs, where XXH3 is
particularly impressive. If you find yourself needing 128 bits of entropy (e.g.
if you’re hashing data for a
DHT) you can either
use the 128 bit variant or xxh3. If you only have an old version available
that doesn’t support the new 128 bit variant two XXH64 runs with
different seeds is usually still faster than any other choice.

Untrusted Data Hashes

While most of the time the threat model for data transfer is “bugs / cosmic
rays”, some of the time people want to defend against bad actors. That’s
where cryptographic hashes come in, most commonly:

md5: It’s slow and not resistant to collisions … everything I love in a hash
function.
sha1: This hash is at least somewhat
performant and isn’t as completely broken as md5. We should probably still
consider it on shaky
ground
though.
sha2 (specifically sha256): An
ubiquitous hash and unlike md5 and sha1 it is still considered secure, so
that is nice. Unfortunately this hash is noticeably slow when used in
performance critical applications.
sha3: The latest hash that NIST
standardized (because they had concerns about SHA-2 that appear at this
time somewhat unfounded). I was not able to find it packaged outside of
openssl and it doesn’t appear to have great cli or language support at
this time. It’s also still pretty slow.

Faster Choice

Try BLAKE3. Yes it is a new (2020)
algorithm and there are concerns about security margin but I’m just so tired of
waiting for sha256. In practice, this is probably a much better choice than
the known-to-be-completely-broken md5 so if you’re reaching for md5 over
xxHash because you need a cryptographic alternative, consider blake3
instead. Also blake3 uses hash trees (merkle
trees) which are wonderful when
implemented correctly and I wish more systems used them.

The one major downside of blake3 in my opinion is that at this time (2021) I
only know of really good cli, Rust, Go, C, and Python implementations
and I can’t personally vouch for the Java JNI bindings. I’ve only needed to
use it so far from streaming pipeline verifications or artifact verification so
the cli and python implementations are good enough for me.

A quick reminder that there are lots of security-sensitive hashing situations
where you don’t want a fast hash. For example, one situation where you want an
intentionally

Use fast data algorithms (2021) by Ice_cream_suit

Use fast data algorithms (2021) by Ice_cream_suit

Share This Article

Newsletter

Hashing

Trusted Data Hashes

Untrusted Data Hashes

HackTech

Leave a comment Cancel reply

Editor's Choice

Use fast data algorithms (2021) by Ice_cream_suit

Use fast data algorithms (2021) by Ice_cream_suit

Share This Article

Newsletter

Hashing

Trusted Data Hashes

Untrusted Data Hashes

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter