In this post, we will explore how Unix pipes are implemented in Linux by iteratively optimizing a test program that writes and reads data through a pipe.1
We will begin with a simple program with a throughput of around 3.5GiB/s, and improve its performance twentyfold. The improvements will be informed by profiling the program using Linux’s perf
tooling.2 The code is available on GitHub.
The post was inspired by reading a highly optimized FizzBuzz program, which pushes output to a pipe at a rate of ~35GiB/s on my laptop.3 Our first goal will be to match that speed, explaining every step as we go along. We’ll also add an additional performance-improving measure, which is not needed in FizzBuzz since the bottleneck is actually computing the output, not IO, at least on my machine.
We will proceed as follows:
- A first slow version of our pipe test bench;
- How pipes are implemented internally, and why writing and reading from them is slow;
- How the
vmsplice
andsplice
syscalls let us get around some (but not all!) of the slowness; - A description of Linux paging, leading up to a faster version using huge pages;
- The final optimization, replacing polling with busy looping;
- Some closing thoughts.
Section 4 is the heaviest on Linux kernel internals, so it might be interesting even if you’re familiar with the other topics treated in the post. For readers not familiar with the topics treated, only basic knowledge of C is assumed.
Let’s begin!
The challenge, and a slow first version #
First of all, let’s start with measuring the performance of the fabled FizzBuzz program, following the rules laid down by the StackOverflow post:
% ./fizzbuzz | pv >/dev/null
422GiB 0:00:16 [36.2GiB/s]
pv
is “pipe viewer”, a handy utility to measure the throughput of data flowing through a pipe. So fizzbuzz
is producing output at a rate of 36GiB/s.
fizzbuzz
writes the output in blocks as big as the L2 cache, to strike a good balance between cheap access to memory and minimizing IO overhead.
On my machine, the L2 cache is 256KiB. Throughout this post, we’ll also output blocks of 256KiB, but without “computing” anything. Essentially, we’ll try to measure the upper bound for programs writing to a pipe with a reasonable buffer size.4
While fizzbuzz
uses pv
to measure speed, our setup will be slightly different: we’ll implement the programs on both ends of the pipe. This is so that we fully control the code involved in pushing and pulling data from the pipe.
The code is available in my pipes-speed-test
repo. write.cpp
implements the writing, and read.cpp
the reading. write
repeatedly writes the same 256KiB forever. read
reads through 10GiB of data and terminates, printing the throughput in GiB/s. Both executables accept a variety of command line options to change their behavior.
The first attempt at reading and writing from pipes will be using the write
and read
syscalls, using the same buffer size as fizzbuzz
. Here’s a view of the writing end:
int main() {
size_t buf_size = 1 << 18; // 256