Zlib-rs is faster than C by dochtman

Share This Article

Sed ut perspiciatis unde.

We’ve released version 0.4.2 of zlib-rs, featuring a number of substantial performance improvements. We are now (to our knowledge) the fastest api-compatible zlib implementation for decompression, and beat the competition in the most important compression cases too.

We’ve built a dashboard that shows the performance of the current main branch compared to other implementations, and tracks our performance over time to catch any regressions and visualize our progress.

This post compares zlib-rs to the latest zlib-ng and, for decompression, also to zlib-chromium. These are the leading C zlib implementations that focus on performance. We’ll soon write a blog post with more technical details, and only cover the most impactful changes briefly.

Decompression

Last time, we benchmarked using the target-cpu=native flag. That gave the best results for our implementation, but was not entirely fair because our rust implementation could assume that certain SIMD capabilities would be available, while zlib-ng had to check for them at runtime.

We have now made some changes so that we can efficiently select the most optimal implementation at runtime too.

Multiversioning

Picking the best version of a function is known as multiversioning. We have a baseline implementation that works on all CPUs, and then some number of specialized versions that use SIMD instructions or other features that may or may not be available on a particular CPU. The challenge is to always pick the optimal implementation, but with minimal runtime cost. That means we want to do the runtime check as few times as possible, and then perform a large chunk of work.

Today, multiversioning is not natively supported in rust. There are proposals for adding it (which we’re very excited about!), but for now, we have to implement it manually which unfortunately involves some unsafe code. We’ll write more about this soon (for the impatient, the relevant code is here).

DFA optimizations

The C code is able to use switch implicit fallthroughs to generate very efficient code. Rust does not have an equivalent of this mechanism, and this really slowed us down when data comes in in small chunks.

Nikita Popov suggested we try the -Cllvm-args=-enable-dfa-jump-thread option, which recovers most of the performance here. It performs a kind of jump threading for deterministic finite automata, and our decompression logic matches this pattern.

LLVM does not currently enable this flag by default, but that is the plan eventually. We’re also looking into supporting this optimization in rustc itself, and making it more fine-grained than just blindly applying it to a whole project and hoping for the best.

These efforts are a part of a proposed project goal and Trifecta Tech Foundation’s code generation initiative.

Benchmarks

As far as we know, we’re the fastest api-compatible zlib implementation today for decompression. Not only do we beat zlib-ng by a fair margin, we’re also faster than the implementation used in chromium.

Like before, our benchmark is decompressing a compressed version of silesia-small.tar, feeding the state machine the input in power-of-2 sized chunks. Small chunk sizes simulate the streaming use case, larger chunk sizes model cases where the full input

unsafe { let x_tmp0 = _mm_clmulepi64_si128(xmm_crc0, crc_fold, 0x10); xmm_crc0 = _mm_clmulepi64_si128(xmm_crc0, crc_fold, 0x01); xmm_crc1 = _mm_xor_si128(xmm_crc1, x_tmp0); xmm_crc1 = _mm_xor_si128(xmm_crc1, xmm_crc0);

9 Comments

Post Author

IshKebab

Posted March 16, 2025 at 7:55 pm

It's barely faster. I would say it's more accurate to say it's as fast as C, which is still a great achievement.

0Likes Log in to Reply
Post Author

johnisgood

Posted March 16, 2025 at 8:00 pm

"faster than C" almost always boils down to different designs, implementations, algorithms, etc.

Perhaps it is faster than already-existing implementations, sure, but not "faster than C", and it is odd to make such claims.

0Likes Log in to Reply
Post Author

kahlonel

Posted March 16, 2025 at 8:04 pm

You mean the implementation is faster than the one in C. Because nothing is “faster than C”.

0Likes Log in to Reply
Post Author

YZF

Posted March 16, 2025 at 8:12 pm
I found out I already know Rust:

unsafe { let x_tmp0 = _mm_clmulepi64_si128(xmm_crc0, crc_fold, 0x10); xmm_crc0 = _mm_clmulepi64_si128(xmm_crc0, crc_fold, 0x01); xmm_crc1 = _mm_xor_si128(xmm_crc1, x_tmp0); xmm_crc1 = _mm_xor_si128(xmm_crc1, xmm_crc0);

Kidding aside, I thought the purpose of Rust was for safety but the keyword unsafe is sprinkled liberally throughout this library. At what point does it really stop mattering if this is C or Rust?

Presumably with inline assembly both languages can emit what is effectively the same machine code. Is the Rust compiler a better optimizing compiler than C compilers?
0Likes Log in to Reply
Post Author

water9

Posted March 16, 2025 at 8:15 pm

[flagged]

0Likes Log in to Reply
Post Author

cb321

Posted March 16, 2025 at 8:32 pm

I think this may not be a very high bar. zippy in Nim claims to be about 1.5x to 2.0x faster than zlib: https://github.com/guzba/zippy I think there are also faster zlib's around in C than the standard install one, such as https://github.com/ebiggers/libdeflate (EDIT: also mentioned elsethread https://news.ycombinator.com/item?id=43381768 by mananaysiempre)

zlib itself seems pretty antiquated/outdated these days, but it does remain popular, even as a basis for newer parallel-friendly formats such as https://www.htslib.org/doc/bgzip.html

0Likes Log in to Reply
Post Author

jrockway

Posted March 16, 2025 at 8:47 pm

Chromium is kind of stuck with zlib because it's the algorithm that's in the standards, but if you're making your own protocol, you can do even better than this by picking a better algorithm. Zstandard is faster and compresses better. LZ4 is much faster, but not quite as small.

Some reading: https://jolynch.github.io/posts/use_fast_data_algorithms/

(As an aside, at my last job container pushes / pulls were in the development critical path for a lot of workflows. It turns out that sha256 and gzip are responsible for a lot of the time spent during container startup. Fortunately, Zstandard is allowed, and blake3 digests will be allowed soon.)

0Likes Log in to Reply
Post Author

amorio2341

Posted March 16, 2025 at 8:56 pm

Not surprised at all, Rust is the future.

0Likes Log in to Reply
Post Author

akagusu

Posted March 16, 2025 at 10:37 pm

Bravo. Now Rust has its existence justified.

0Likes Log in to Reply

Zlib-rs is faster than C by dochtman

Zlib-rs is faster than C by dochtman

Share This Article

Newsletter

Decompression

Multiversioning

DFA optimizations

Benchmarks

HackTech

9 Comments

IshKebab

johnisgood

kahlonel

YZF

water9

cb321

jrockway

amorio2341

akagusu

Leave a comment Cancel reply

Editor's Choice

Zlib-rs is faster than C by dochtman

Zlib-rs is faster than C by dochtman

Share This Article

Newsletter

Decompression

Multiversioning

DFA optimizations

Benchmarks

9 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter