Hi, it’s us again. You might remember us from when we made significant performance-related changes to wireguard-go, the userspace WireGuard® implementation that Tailscale uses. We’re releasing a set of changes that further improves client throughput on Linux. We intend to upstream these changes to WireGuard as we did with the previous set of changes, which have since landed upstream.
With this new set of changes, Tailscale joins the 10Gb/s club on bare metal Linux, and wireguard-go pushes past (for now) the in-kernel WireGuard implementation on that hardware. How did we do it? Through UDP segmentation offload and checksum optimizations. You can experience these improvements in the current unstable Tailscale client release, and also in Tailscale v1.40, available in the coming days. Continue reading to learn more, or jump down to the Results section if you just want numbers.
Background
The data plane in Tailscale is built atop wireguard-go, a userspace WireGuard implementation written in Go. wireguard-go acts as a pipeline, receiving packets from the operating system via a TUN interface. It encrypts them, assuming a valid peer exists for their addressed destination, and sends them to a remote peer via a UDP socket. The flow in the opposite direction is similar. Packets from valid peers are decrypted after being read from a UDP socket, then are written back to the kernel’s TUN interface driver.
The changes we made in v1.36 modified this pipeline, enabling packet vectors to flow end-to-end, rather than single packets. The techniques applied on both ends of the pipeline reduced the number of system calls per packet, and on the TUN side they reduced the cost of moving a packet through the kernel networking stack.
This greatly improved throughput, and we have continued to build upon it with the changes we describe in this post.
Baseline
Disclaimer about benchmarks: This post contains benchmarks! These benchmarks are reproducible at the time of writing, and we provide details about the environments we ran them in. But benchmark results tend to vary across environments, and they also tend to go stale as time progresses. Your mileage may vary.
Before getting into the details of what we changed, we need to record some baselines for later comparison. These benchmarks are conducted using iperf3, as single stream TCP tests, with cubic congestion control. All hosts are running Ubuntu 22.04 with the latest available Linux kernel for that distribution.
We baselined throughput for wireguard-go@052af4a and in-kernel WireGuard. These tests were conducted between two pairs of hosts:
- 2 x AWS c6i.8xlarge instance types
- 2 x “bare metal” servers powered by i5-12400 CPUs & Mellanox MCX512A-ACAT NICs
For consistency, the c6i.8xlarge instance type is the same we used in the precursory blog post. The instances are in the same region and availability zone:
ubuntu@c6i-8xlarge-1:~$ ec2metadata | grep -E 'instance-type:|availability-zone:'
availability-zone: us-east-2b
instance-type: c6i.8xlarge
ubuntu@c6i-8xlarge-2:~$ ec2metadata | grep -E 'instance-type:|availability-zone:'
availability-zone: us-east-2b
instance-type: c6i.8xlarge
ubuntu@c6i-8xlarge-1:~$ ping 172.31.23.111 -c 5 -q
PING 172.31.23.111 (172.31.23.111) 56(84) bytes of data.
--- 172.31.23.111 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4094ms
rtt min/avg/max/mdev = 0.109/0.126/0.168/0.022 ms
We’ve added the i5-12400 systems for a bare metal comparison with interfaces operating above 10Gb/s. The i5-12400 CPU is a modern (released Q1 2022) desktop-class chip, available for $183 USD at the time of writing. The Mellanox NICs are connected at 25Gb/s via a direct attach copper (DAC) cable:
jwhited@i5-12400-1:~$ lscpu | grep Model.name && cpupower frequency-info -d && cpupower frequency-info -p
Model name: 12th Gen Intel(R) Core(TM) i5-12400
analyzing CPU 0:
driver: intel_pstate
analyzing CPU 0:
current policy: frequency should be within 800 MHz and 5.60 GHz.
The governor "performance" may decide which speed to use
within this range.
jwhited@i5-12400-1:~$ sudo ethtool enp1s0f0np0 | grep Speed && sudo ethtool -i enp1s0f0np0 | egrep 'driver|^version'
Speed: 25000Mb/s
driver: mlx5_core
version: 5.15.0-69-generic
jwhited@i5-12400-2:~$ lscpu | grep Model.name && cpupower frequency-info -d && cpupower frequency-info -p
Model name: 12th Gen Intel(R) Core(TM) i5-12400
analyzing CPU 0:
driver: intel_pstate
analyzing CPU 0:
current policy: frequency should be within 800 MHz and 5.60 GHz.
The governor "performance" may decide which speed to use
within this range.
jwhited@i5-12400-2:~$ sudo ethtool enp1s0f0np0 | grep Speed && sudo ethtool -i enp1s0f0np0 | egrep 'driver|^version'
Speed: 25000Mb/s
driver: mlx5_core
version: 5.15.0-69-generic
jwhited@i5-12400-1:~$ ping 10.0.0.20 -c 5 -q
PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data.
--- 10.0.0.20 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4078ms
rtt min/avg/max/mdev = 0.008/0.035/0.142/0.053 ms
Now for the iperf3 baseline tests.
c6i.8xlarge over in-kernel WireGuard:
ubuntu@c6i-8xlarge-1:~$ iperf3 -i 0 -c c6i-8xlarge-2-wg -t 10 -C cubic -V
iperf 3.9
Linux c6i-8xlarge-1 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64
Control connection MSS 1368
Time: Wed, 12 Apr 2023 23:56:53 GMT
Connecting to host c6i-8xlarge-2-wg, port 5201
Cookie: 3jzl3sa34hkbpwbmg4dbfh6aovbknnw7x5hn
TCP MSS: 1368 (default)
[ 5] local 10.9.9.1 port 51194 connected to 10.9.9.2 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 3.11 GBytes 2.67 Gbits/sec 51 1.00 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 3.11 GBytes 2.67 Gbits/sec 51 sender
[ 5] 0.00-10.05 sec 3.11 GBytes 2.66 Gbits/sec receiver
CPU Utilization: local/sender 5.1% (0.3%u/4.8%s), remote/receiver 11.2% (0.2%u/11.0%s)
snd_tcp_congestion cubic
rcv_tcp_congestion cubic
c6i.8xlarge over wireguard-go@052af4a:
ubuntu@c6i-8xlarge-1:~$ iperf3 -i 0 -c c6i-8xlarge-2-wg -t 10 -C cubic -V
iperf 3.9
Linux c6i-8xlarge-1 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64
Control connection MSS 1368
Time: Wed, 12 Apr 2023 23:55:42 GMT
Connecting to host c6i-8xlarge-2-wg, port 5201
Cookie: zlcrq3xqyr6cfmrtysrm42xcg3bbjzir3qob
TCP MSS: 1368 (default)
[ 5] local 10.9.9.1 port 54410 connected to 10.9.9.2 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 6.21 GBytes 5.34 Gbits/sec 0 3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 6.21 GBytes 5.34 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 6.21 GBytes 5.31 Gbits/sec receiver
CPU Utilization: local/sender 8.6% (0.2%u/8.4%s), remote/receiver 11.8% (0.6%u/11.2%s)
snd_tcp_congestion cubic
rcv_tcp_congestion cubic
i5-12400 over in-kernel WireGuard:
jwhited@i5-12400-1:~$ iperf3 -i 0 -c i5-12400-2-wg -t 10 -C cubic -V
iperf 3.9
Linux i5-12400-1 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64
Control connection MSS 1368
Time: Wed, 12 Apr 2023 23:41:44 GMT
Connecting to host i5-12400-2-wg, port 5201
Cookie: hqkn7s3scipxku5rzpcgqt4rakutkpwybtvx
TCP MSS: 1368 (default)
[ 5] local 10.9.9.1 port 48564 connected to 10.9.9.2 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 13.7 GBytes 11.8 Gbits/sec 8725 753 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 13.7 GBytes 11.8 Gbits/sec 8725 sender
[ 5] 0.00-10.04 sec 13.7 GBytes 11.7 Gbits/sec receiver
CPU Utilization: local/sender 26.3% (0.1%u/26.2%s), remote/receiver 17.4% (0.5%u/16.9%s)
snd_tcp_congestion cubic
rcv_tcp_congestion cubic
i5-12400 over wireguard-go@052af4a:
jwhited@i5-12400-1:~$ iperf3 -i 0 -c i5-12400-2-wg -t 10 -C cubic -V
iperf 3.9
Linux i5-12400-1 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64
Control connection MSS 1368
Time: Wed, 12 Apr 2023 23:39:22 GMT
Connecting to host i5-12400-2-wg, port 5201
Cookie: ohzzlzkcvnk45ya32vm75ezir6njydqwipkl
TCP MSS: 1368 (default)
[ 5] local 10.9.9.1 port 52486 connected to 10.9.9.2 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 9.74 GBytes 8.36 Gbits/sec 507 3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 9.74 GBytes 8.36 Gbits/sec 507 sender
[ 5] 0.00-10.05 sec 9.74 GBytes 8.32 Gbits/sec receiver
CPU Utilization: local/sender 11.7% (0.1%u/11.6%s), remote/receiver 6.5% (0.2%u/6.3%s)
snd_tcp_congestion cubic
rcv_tcp_congestion cubic
With the baselines captured, let’s look at some profiling data to understand where we may be bottlenecked.
Linux perf and flame graphs
The flame graphs below were rendered from perf data. They represent the amount of CPU time spent for a given function/stack. The wider the function, the more expensive it (and/or its children) are. These are interactive; you can click to zoom and hover to see percentages.
This first graph is from the iperf3 sender:
Notably, more time is being spent sending UDP packets than encrypting their payloads. Let’s take a look at the receiver:
The receiver looks fairly similar, with UDP reception being nearly equal in time spent relative to decryption.
We are using the {send,recv}mmsg()
(two m’s) system calls, which help to amortize the cost of making a syscall. However, on the kernel side of the system call, we see {send,recv}mmsg()
calls into {send,recv}msg()
(single m). This means that we still pay the cost of traversing the kernel networking stack for every single packet, because the kernel side simply iterates through the batch.
On the TUN side of wireguard-go, we make use of TCP segmentation offload (TSO) and generic receive offload (GRO), which enable multiple TCP segments to pass through the kernel stack as a single segment:
What we need is something similar, but for UDP. Enter UDP generic segmentation offload.
UDP generic segmentation offload (GSO)
UDP GSO enables the kernel to delay segmentation of a batch of UDP datagrams in a similar fashion to the TCP variant, reducing the CPU cycles per byte cost of traversing the networking stack. Linux support was authored by Willem de Bruijn and introduced into the kernel in v4.18. UDP GSO was propelled by the adoption of QUIC in the datacenter, but its benefits are not limited to QUIC. It is best described by part of its summary commit message:
Segmentation offload reduces cycles/byte for large packets by
amortizing the cost of protocol stack traversal.This patchset implements GSO for UDP. A process can concatenate and
submit multiple datagrams to the same destination in one send call
by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
or passing an analogous cmsg at send time.The stack will send the entire large (up to network layer max size)
datagram through the protocol layer. At the GSO layer, it is broken
up in individual segments. All receive the same network layer header
and UDP src and dst port. All but the last segment have the same UDP
header, but the last may differ in length and checksum.”
After implementing UDP GSO on the UDP socket side of wireguard-go, the transmit direction now looks like this:
But what about the receive path? It would be ideal to optimize both directions. Paolo Abeni authored UDP generic receive offload (GRO) support, and it was introduced into the Linux kernel in v5.0. With UDP GRO the receive direction now looks like this:
Updates to the UDP man page for these new features eventually arrived, in which an important requirement for UDP GSO is described:
Segmentation offload depends on checksum offload, as datagram checksums are computed after segmentation.
Checksum offload is widely supported across ethernet devices today. It also reduces the cost of the kernel networking stack, as e