As an engineer who primarily works with data and databases I spend a lot of
time moving data around, hashing it, compressing it, decompressing it and
generally trying to shovel it between VMs and blob stores over TLS. I am
constantly surprised by how many systems only support slow, inefficient, and
expensive ways of doing these operations.
In my experience, these poor algorithm choices are orders of magnitude slower
than modern alternatives. Indeed, using a fast algorithm can often be the
difference between actually doing compression/hashing/encryption and “Eh, I’ll
skip it”. In the interest of using more of our CPU time to do useful work
instead of melting ice-caps and giving programmers excessively long coffee
breaks, we will explore some faster alternatives in this post.
TLDR
Disclaimer: There are lies, damn lies, and benchmarks from some random person
on the internet.If you are considering taking some of the advice in this
post please remember to test your specific workloads, which might have
different bottlenecks. Also the implementation quality in your particular
software stack for your particular hardware matters a lot.
For this post I’ll be playing with a ~5 GiB real-world JSON
dataset on my laptop’s
Intel Core i7-8565U
pinned to
4GHz.
Since I want to benchmark the algorithms instead of disks I’ll be pre-touching
the file into RAM with vmtouch
. Remember that
on most modern cloud servers with fast NVMe
storage (multiple GiBps) and
good page-caching algorithms your disks are likely not your bottleneck.
$ du -shc yelp_academic_dataset_review.json
6.5G yelp_academic_dataset_review.json
$ vmtouch -t yelp_academic_dataset_review.json
# ...
$ vmtouch yelp_academic_dataset_review.json
Files: 1
Directories: 0
Resident Pages: 1693525/1693525 6G/6G 100%
Elapsed: 0.1256 seconds
Hashing
I would like to check that this blob of data over here is the same as that
data over there.
Trusted Data Hashes
These hash or checksum functions are used to ensure data integrity and usually
are defending against bugs/bitrot/cosmic rays instead of malicious attackers.
I typically see the following poor choices:
md5
: This hash is both weak and slow. It does have the advantage of
being one of the fastest slow choices in standard libraries and therefore it
is somewhat common. Two of my favorite examples are slow Cassandra Quorum
Reads and slow S3
upload/download (seriously, just try disablingmd5
on parts and see how
much faster S3 is).crc32
oradler32
: I often hear “well crc32 is fast”, but to be honest I
haven’t in practice come across particularly fast implementations in
real-world deployments. Sure theoretically there are hardware
implementations, but at least in most Java or Python programs I profile
running on most servers I encounter these checksums are not particularly fast
and only generate 32 bits of output where I absolutely have to care about
collisions.sha2
: An oddly common choice for trusted data hashes. Odd because it is
slow and you don’t need a cryptographic hash for syncing a file
between two hosts within an internal network with an authorized transfer
protocol (e.g.rsync
or via backups toS3
with proper AWS
IAM
policies). If attackers have access to your backups or artifact store you
have a bigger problem than them forging a hash.
Faster Choice
Try xxHash
. It is blazing fast, high
quality, and the XXH64
variant is usually sufficient for most data integrity
checks. It even performs well on small data inputs, where XXH3
is
particularly impressive. If you find yourself needing 128 bits of entropy (e.g.
if you’re hashing data for a
DHT
) you can either
use the 128 bit variant or xxh3
. If you only have an old version available
that doesn’t support the new 128 bit variant two XXH64
runs with
different seeds is usually still faster than any other choice.
Untrusted Data Hashes
While most of the time the threat model for data transfer is “bugs / cosmic
rays”, some of the time people want to defend against bad actors. That’s
where cryptographic hashes come in, most commonly:
md5
: It’s slow and not resistant to collisions … everything I love in a hash
function.sha1
: This hash is at least somewhat
performant and isn’t as completely broken asmd5
. We should probably still
consider it on shaky
ground
though.sha2
(specificallysha256
): An
ubiquitous hash and unlikemd5
andsha1
it is still considered secure, so
that is nice. Unfortunately this hash is noticeably slow when used in
performance critical applications.sha3
: The latest hash that NIST
standardized (because they had concerns aboutSHA-2
that appear at this
time somewhat unfounded). I was not able to find it packaged outside of
openssl
and it doesn’t appear to have greatcli
or language support at
this time. It’s also still pretty slow.
Faster Choice
Try BLAKE3
. Yes it is a new (2020)
algorithm and there are concerns about security margin but I’m just so tired of
waiting for sha256
. In practice, this is probably a much better choice than
the known-to-be-completely-broken md5
so if you’re reaching for md5
over
xxHash
because you need a cryptographic alternative, consider blake3
instead. Also blake3
uses hash trees (merkle
trees) which are wonderful when
implemented correctly and I wish more systems used them.
The one major downside of blake3
in my opinion is that at this time (2021) I
only know of really good cli
, Rust
, Go
, C
, and Python
implementations
and I can’t personally vouch for the Java
JNI bindings. I’ve only needed to
use it so far from streaming pipeline verifications or artifact verification so
the cli
and python
implementations are good enough for me.
A quick reminder that there are lots of security-sensitive hashing situations
where you don’t want a fast hash. For example, one situation where you want an
intentionally