Disk I/O bottlenecks in GitHub Actions by jacobwg

Share This Article

Sed ut perspiciatis unde.

Disk I/O bottlenecks are easy to overlook when analyzing CI pipeline performance, but tools like iostat and fio can help shed a light on what might be slowing down your pipelines more than you know.

GitHub offers different hosted-runners with a range of specs, but for this test we are using the default ubuntu-22.04 runner in a private repository, which does give us an additional 2 vCPUs but does not alter the disk performance.

How to monitor disk performance

Getting a baseline benchmark from a tool like fio is useful for comparing the relative disk performance of different runners. However, to investigate if you are hitting disk I/O bottlenecks in your CI pipeline, it is more useful to monitor disk performance during the pipeline execution.

We can use a tool like iostat to monitor the disk while installing dependencies from the cache to see how much we are saturating the disk.

- name: Start IOPS Monitoring
  run: |
    echo "Starting IOPS monitoring"
    # Start iostat in the background, logging IOPS every second to iostat.log
    nohup iostat -dx 1 > iostat.log 2>&1 &
    echo $! > iostat_pid.txt  # Save the iostat process ID to stop it later

- uses: actions/cache@v4
  timeout-minutes: 5
  id: cache-pnpm-store
  with:
    path: ${{ steps.get-store-path.outputs.STORE_PATH }}
    key: pnpm-store-${{ hashFiles('pnpm-lock.yaml') }}
    restore-keys: |
      pnpm-store-
      pnpm-store-${{ hashFiles('pnpm-lock.yaml') }}

- name: Stop IOPS Monitoring
  run: |
    echo "Stopping IOPS monitoring"
    kill $(cat iostat_pid.txt)

- name: Save IOPS Data
  uses: actions/upload-artifact@v4
  with:
    name: iops-log
    path: iostat.log

Monitoring disk during untar of Next.js dependencies

In the above test, we used iostat to monitor disk performance while the cache action downloaded and untarred the dependencies for vercel/next.js:

Received 96468992 of 343934082 (28.0%), 91.1 MBs/sec
Received 281018368 of 343934082 (81.7%), 133.1 MBs/sec
Cache Size: ~328 MB (343934082 B)
/usr/bin/tar -xf /home/<path>/cache.tzst -P -C /home/<path>/gha-disk-benchmark --use-compress-program unzstd
Received 343934082 of 343934082 (100.0%), 108.8 MBs/sec
Cache restored successfully

The full step took 12s to complete, and we can estimate the download took around 3s, leaving 9s for the untar operation.

The compressed tarball is only about 328MB, but after extraction, the total amount of data written to the disk is about 1.6GB. That smaller size got our cache across the network plenty fast, and most CPUs can handle decompression fast enough, meaning higher compression is often favorable. Once download and decompression are no longer the bottleneck, that leaves writing to disk.

Reading from a tarball is a fairly efficient process as it’s mostly sequential reads, however, we then need to write each file to disk. This is where we can hit disk I/O bottlenecks, especially with a large number of small files.

Itâs important to note that this is just a single run, not an average. Running multiple tests over time will give you a much clearer picture of the overall performance. Variance between runs can be quite high, so an individual bad run doesnât necessarily indicate a problem.

What this run suggests is a possible throughput bottleneck. Weâre seeing spikes in the maximum total throughput, with most hovering around ~220MB/s. This is likely the maximum throughput we are able to achieve to this disk, we’ll verify this next. We should continue to monitor this and compare it to other runners to see if we can find an ideal runner for our workflow. We’ll use fio to double-check if we are hitting the disk’s maximum throughput.

An interesting aside before we move on, we can see from this side-by-side how relatively low read operations to writes there are. Since weâre reading from a tarball, most reads are sequential, which tends to be more efficient. That read data is likely going into a buffer before being written to the disk in a more random pattern as it creates a copy of each file. This is why we see a higher write IOPS than read IOPS.

Maximum disk throughput

One of the first optimizations developers usually make to their CI pipelines is caching dependencies. Even though the cache still gets uploaded and downloaded with each run, it speeds things up by packaging all your dependencies into one compressed file. This skips the hassle of resolving dependencies, avoids multiple potentially slow downloads, and cuts down on network delays.

But as we saw above, network speed isn’t usually our bottleneck when downloading the cache.

Test Type	Block Size	Bandwidth
Read Throughput	1024KiB	~209MB/s
Write Throughput	1024KiB	~209MB/s

Using fio to test our throughput, notice that both “read” and “write” throughput are both capped at the same value. This is a fairly telling sign that the limitation here is not actually the disk physically, but rather a bandwidth limit imposed by GitHub. This is a standard practice to divide up resources among multiple users who may be accessing the same physical disk from their virtual machines. It isn’t always documented, but most providers will have higher bandwidth limits on higher tier runners.

What we measured here aligns fairly closely with the 220MB/s we saw in the untar test, giving us another hint that we are likely being slowed down during our dependency installation, not by the network or CPU, but by the disk.

Regardless of how fast our download speed is, we won’t be able to write to disk any faster than our max throughput to the disk.

Uncompressed Cache Size

Disk Bandwidth

Estimated time to write to disk: Select a cache payload and throughput speed

Realistically, your disk performance will vary greatly depending on your specific cache size, the number of files, and just general build-to-build variance. That’s why it’s a good idea to monitor your CI runners for a consistent baseline, and we’ll talk about testing your workflow on multiple runners for comparison.

Maximum IOPS (Input/Output Operations Per Second)

After downloading the cache tarball, it needs to be extracted. Depending on the compression level it could be a CPU-intensive operation but this isn’t usually a problem. When untar-ing the dependencies, we are performing a lot of small read and write operations, which is where we can hit disk I/O bottlenecks.

Test Type	Block Size	IOPS
Read IOPS	4096B	~51K
Write IOP

Post Author

ValdikSS

Posted March 28, 2025 at 3:57 pm

`apt` installation could be easily sped-up with `eatmydata`: `dpkg` calls `fsync()` on all the unpacked files, which is very slow on HDDs, and `eatmydata` hacks it out.

0Likes Log in to Reply
Post Author

suryao

Posted March 28, 2025 at 4:08 pm

TLDR: disk is often the bottleneck in builds. Use 'fio' to get performance of the disk.

If you want to truly speed up builds by optimizing disk performance, there are no shortcuts to physically attaching NVMe storage with high throughput and high IOPS to your compute directly.

That's what we do at WarpBuild[0] and we outperform Depot runners handily. This is because we do not use network attached disks which come with relatively higher latency. Our runners are also coupled with faster processors.

I love the Depot content team though, it does a lot of heavy lifting.

[0] https://www.warpbuild.com

0Likes Log in to Reply
Post Author

miohtama

Posted March 28, 2025 at 4:36 pm

If you can afford, upgrade your CI runners on GitHub to paid offering. Highly recommend, less drinking coffee, more instant unit test results. Pay as you go.

0Likes Log in to Reply
Post Author

jacobwg

Posted March 28, 2025 at 4:48 pm

A list of fun things we've done for CI runners to improve CI:

– Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)

– Benchmarked EC2 instance types (m7a is the best x86 today, m8g is the best arm64)

– "Warming" the root EBS volume by accessing a set of priority blocks before the job starts to give the job full disk performance [0]

– Launching each runner instance in a public subnet with a public IP – the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)

– Configuring Docker with containerd/estargz support

– Just generally turning kernel options and unit files off that aren't needed

[0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-initial…

0Likes Log in to Reply
Post Author

larusso

Posted March 28, 2025 at 5:27 pm

So I had to read to the end to realize it’s a kinda infomercial. Ok fair enough. Didn’t know what depot was though.

0Likes Log in to Reply
Post Author

crmd

Posted March 28, 2025 at 5:33 pm

This is exactly the kind of content marketing I want to see. The IO bottleneck data and the fio scripts are useful to all. Then at the end a link to their product which I’d never heard of, in case you’re dealing with the issue at hand.

0Likes Log in to Reply
Post Author

nodesocket

Posted March 28, 2025 at 5:41 pm

I just migrated multiple ARM64 GitHub action Docker builds from my self hosted runner (Raspberry Pi in my homeland) to Blacksmith.io and I’m really impressed with the performance so far. Only downside is no Docker layer and image cache like I had on my self hosted runner, but can’t complain on the free tier.

0Likes Log in to Reply
Post Author

kayson

Posted March 28, 2025 at 8:55 pm

Bummer there's no free tier. I've been bashing my head against an intermittent CI failure problem on Github runners for probably a couple years now. I think it's related to the networking stack in their runner image and the fact that I'm using docker in docker to unit test a docker firewall. While I do appreciate that someone at Github did actually look at my issue, they totally missed the point. https://github.com/actions/runner-images/issues/11786

Are there any reasonable alternatives for a really tiny FOSS project?

0Likes Log in to Reply
Post Author

crohr

Posted March 28, 2025 at 9:17 pm

I'm maintaining a benchmark of various GitHub Actions providers regarding I/O speed [1]. Depot is not present because my account was blocked but would love to compare! The disk accelerator looks like a nice feature.

[1]: https://runs-on.com/benchmarks/github-actions-disk-performan…

0Likes Log in to Reply

Disk I/O bottlenecks in GitHub Actions by jacobwg

Disk I/O bottlenecks in GitHub Actions by jacobwg

Share This Article

Newsletter

How to monitor disk performance

Monitoring disk during untar of Next.js dependencies

Maximum disk throughput

Maximum IOPS (Input/Output Operations Per Second)

HackTech

9 Comments

ValdikSS

suryao

miohtama

jacobwg

larusso

crmd

nodesocket

kayson

crohr

Leave a comment Cancel reply

Editor's Choice

Disk I/O bottlenecks in GitHub Actions by jacobwg

Disk I/O bottlenecks in GitHub Actions by jacobwg

Share This Article

Newsletter

How to monitor disk performance

Monitoring disk during untar of Next.js dependencies

Maximum disk throughput

Maximum IOPS (Input/Output Operations Per Second)

9 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter