Welcome to LWN.net
The following subscription-only content has been made available to you
by an LWN subscriber. Thousands of subscribers depend on LWN for the
best news from the Linux and free software communities. If you enjoy this
article, please consider subscribing to LWN. Thank you
for visiting LWN.net!
April 8, 2022
This article was contributed by Neil Brown
The readahead code in the Linux kernel is nominally responsible for
reading data that has not yet been explicitly requested from storage,
with the idea that it might be needed soon. The code is stable, functional, widely
used, and uncontroversial, so it is reasonable to expect the code to be of
high quality, and largely this is true. Recently, I found the need to
document this code, which naturally shone a rather different light on
it. This work revealed minor problems with functionality and significant
problems with naming.
My particular reason for wanting documentation probably colors my view
of the code so I’ll start there. Once upon a time, Linux had a strong
concept of “congestion” as it applied to I/O paths. If the queue of
requests to some device grew too large, the backing device would be
marked as “congested” and certain optional I/O requests would be skipped
or delayed, particularly writeback and readahead. As time has passed,
so too (apparently) has the need for congestion management. Maybe this
is because many I/O devices are now faster than our CPUs but, whatever
the reason, the block layer no longer tracks congestion and only a few
virtual “backing devices” continue this outdated practice.
In Linux 5.16, the only backing device that gets marked as “read
congested” is the virtual device
used for FUSE filesystems. As part of a project to remove all remnants
of congestion tracking, I proposed that there was really nothing special about
FUSE, and it should just accept all readahead requests just like
everyone else. Miklos Szeredi, the maintainer of FUSE, found my
reasoning to be unsatisfactory — and who could blame him? If
FUSE doesn’t want readahead requests, it shouldn’t have to accept them.
Trying to understand how FUSE could safely say “no” to readahead,
without having to maintain the congestion-tracking functionality in
common code, started me on the path to understanding readahead — once it
was explained to
me that it wasn’t as simple as just changing the
“readahead” callback in FUSE to return zero.
The main part of the API exported by mm/readahead.c is two functions:
page_cache_sync_ra()
and page_cache_async_ra().
This functionality is also available with a slightly simpler interface as
page_cache_sync_readahead()
and page_cache_async_readahead(),
which
are nicely documented in the
kernel documentation.
Sync and async
Unfortunately, that documentation is not explicit on how the “sync” or
“async” in the names are relevant. Clarifying this was the among my
first tasks so, to help with that clarification, I’ll refer you to a
selection from my
new documentation, which was merged for the 5.18
release. It starts:
Readahead is used to read content into the page cache before it is
explicitly requested by the application. Readahead only ever attempts
to read pages that are not yet in the page cache. If a page is present
but not up-to-date, readahead will not try to read it. In that case a
simple ->readpage() will be requested.Readahead is triggered when an application read request (whether a
system call or a page fault) finds that the requested page is not in the
page cache, or that it is in the page cache and has the PG_readahead
flag set. This flag indicates that the page was loaded as part of a
previous readahead request and now that it has been accessed, it is
time for the next read-ahead.Each readahead request is partly synchronous read, and partly async
readahead.
We stop here, in mid-paragraph, to focus on those two terms: sync and async.
Readahead is, by its nature, asynchronous — nothing is waiting for it.
An explicitly requested read, instead, will ultimately be synchronous, as the
operation cannot complete until the data arrives. These two modes are
clearly related and handling them both in the same code makes sense.
Describing them both as being “readahead” — a choice that was
effectively forced on me by the code — is not so defensible.
Anyone who has been around computers long enough to know that a
“kilobyte” isn’t (necessarily) 1000 bytes will also know that we
technologists often follow the practice of Lewis Carroll’s “Humpty Dumpty” in
Through the
Looking Glass:
“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it
means just what I choose it to mean — neither more nor less.”
We seem to make that mistake rather more than is good for us, and the
readahead code is certainly not innocent.
Each filesystem can provide
an address_space_operations
method, named readahead(), to initiate a read; it is on this
basis that the term “readahead request”
is used in the documentation. There is also an address-space operation
called readpages(), though it was marked
as deprecated in
the middle of 2020 and will be removed for 5.18. These two functions
have much the same functionality (they both issue read requests for a
collection of pages). The newer readahead() has a much better interface
(the details are beyond the scope of this article), but readpages() has
undoubtedly the better name — because that is what they both do. They don’t
just “read ahead” but also issue reads that have explicitly been
requested.
Once one realizes that the functionality of readahead() is just to
submit read requests, some of which the caller will wait for (“sync”)
and some of which the caller won’t wait for (“async”), the intention of
the code starts to become a lot clearer. Names matter.
When readahead can be skipped
Returning to the original problem of giving FUSE the opportunity to skip
readahead, a way forward now appears. The readahead() function that
FUSE supplies must read all the pages that will be waited for, but
it doesn’t need to read the remainder. One of the improvements to the
interface that came with the introduction of the readahead() operation
is that more information is available to the filesystem. This
information includes a struct
f