CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.
Searching for the state-of-the-art, I found a lot of outdated posts, which is unsurprising for a sixty year old problem. Eventually I found a MySQL blog post from November 2021 that presents the following graph, including M1 figures, and gives us some idea that 30GB/s is considered fast:

In fact, in my own testing of the zlib crc32 function, I saw that it performs at around 30GB/s on the M1, so a little better than the graph, which is promising. Possibly that version has been optimised by Apple?
I wanted to try to implement my own version. So, I started at the obvious place, with a special ARM64 instruction designed for calculating CRC32 checksums: CRC32X. This can produce a checksum of 8-bytes, with a latency of 3 cycles. So, theoretically, using this instruction, we could get 3.2GHz / 3 * 8B = 8.5GB/s. On the other hand, CRC32X has a throughput of one per cycle, so supposing we can avoid being latency bound (e.g. by calculating bits of the CRC in chunks, and then combining them) we could get 3.2GHz / 1 * 8B = 25.6GB/s. That’s maybe a little better than the numbers in the MySQL chart, but this is a theoretical best case, not accounting for the overhead of combining the results.
So, can we do better than CRC32X? The M1 can run eight instructions per cycle, and our best idea so far only runs at one instruction p