Introduced in 2019 (kernel 5.1) by Jens Axboe, io_uring (henceforth uring) is a system for providing the kernel with a schedule of system calls, and receiving the results as they’re generated. Whereas epoll and kqueue support multiplexing, where you’re told when you can usefully perform a system call using some set of filters, uring allows you to specify the system calls themselves (and dependencies between them), and execute the schedule at the kernel “dataflow limit”. It combines asynchronous I/O, system call polybatching, and flexible buffer management, and is IMHO the most substantial development in the Linux I/O model since Berkeley sockets (yes, I’m aware Berkeley sockets preceded Linux. Let’s then say that it’s the most substantial development in the UNIX I/O model to originate in Linux):
- Asynchronous I/O without the large copy overheads and restrictions of POSIX AIO
- System call batching/linking across distinct system calls
- Provide a buffer pool, and they’ll be used as needed
- Both polling- and interrupt-driven I/O on the kernel side
The core system calls of uring are wrapped by the C API of liburing. Windows added a very similar interface, IoRing, in 2020. In my opinion, uring ought largely displace epoll in new Linux code. FreeBSD seems to be sticking with kqueue, meaning code using uring won’t run there, but neither did epoll (save through FreeBSD’s somewhat dubious Linux compatibility layer). Both the system calls and liburing have fairly comprehensive man page coverage, including the io_uring.7 top-level page.
Rings
Central to every uring are two ringbuffers holding CQEs (Completion Queue Entries) and SQE (Submission Queue Entries) descriptors (as best I can tell, this terminology was borrowed from the NVMe specification). Note that SQEs are allocated externally to the SQ descriptor ring. SQEs roughly correspond to a single system call: they are tagged with an operation type, and filled in with the values that would traditionally be supplied as arguments to the appropriate function. Userspace is provided references to SQEs on the SQE ring, which it fills in and submits. Submission operates up through a specified SQE, and thus all SQEs before it in the ring must also be ready to go (this is likely the main reason why the SQ holds descriptors to an external ring of SQEs: you can acquire SQEs, and then submit them out of order). The kernel places results in the CQE ring. These rings are shared between kernel- and userspace. The rings must be distinct unless the kernel specifies the IORING_FEAT_SINGLE_MMAP feature (see below).
uring does not generally make use of errno. Synchronous functions return the negative error code as their result. Completion queue entries have the negated error code placed in their res fields.
CQEs are usually 16 bytes, and SQEs are usually 64 bytes (but see IORING_SETUP_SQE128 and IORING_SETUP_CQE32 below). Either way, SQEs are allocated externally to the submission queue, which is merely a ring of 32-bit descriptors.
System calls
The liburing interface will be sufficient for most users, and it is possible to operate almost wholly without system calls when the system is busy. For the sake of completion, here are the three system calls implementing the uring core (from the kernel’s io_uring/io_uring.c):
int io_uring_setup(u32 entries, struct io_uring_params *p); int io_uring_enter(unsigned fd, u32 to_submit, u32 min_complete, u32 flags, const void* argp, size_t argsz); int io_uring_register(unsigned fd, unsigned opcode, void *arg, unsigned int nr_args);
Note that io_uring_enter(2) corresponds more closely to the io_uring_enter2(3) wrapper, and indeed io_uring_enter(3) is defined in terms of the latter (from liburing’s src/syscall.c):
static inline int __sys_io_uring_enter2(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig, size_t sz){ return (int) __do_syscall6(__NR_io_uring_enter, fd, to_submit, min_complete, flags, sig, sz); } static inline int __sys_io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig){ return __sys_io_uring_enter2(fd, to_submit, min_complete, flags, sig, _NSIG / 8); }
io_uring_enter(2) can both submit SQEs and wait until some number of CQEs are available. Its flags parameter is a bitmask over:
Flag | Description |
---|---|
IORING_ENTER_GETEVENTS | Wait until at least min_complete CQEs are ready before returning. |
IORING_ENTER_SQ_WAKEUP | Wake up the kernel thread created when using IORING_SETUP_SQPOLL. |
IORING_ENTER_SQ_WAIT | Wait until at least one entry is free in the submission ring before returning. |
IORING_ENTER_EXT_ARG | (Since Linux 5.11) Interpret sig to be a io_uring_getevents_arg rather than a pointer to sigset_t. This structure can specify both a sigset_t and a timeout.
struct io_uring_getevents_arg { __u64 sigmask; __u32 sigmask_sz; __u32 pad; __u64 ts; }; Is ts nanoseconds from now? From the Epoch? Nope! ns is actually a pointer to a __kernel_timespec, passed to u64_to_user_ptr() in the kernel. One of the uglier aspects of uring. |
IORING_ENTER_REGISTERED_RING | ring_fd is an offset into the registered ring pool rather than a normal file descriptor. |
Setup
The io_uring_setup(2) system call returns a file descriptor, and accepts two parameters, u32 entries and struct io_uring_params *p:
int io_uring_setup(u32 entries, struct io_uring_params *p); struct io_uring_params { __u32 sq_entries; // number of SQEs, filled by kernel __u32 cq_entries; // see IORING_SETUP_CQSIZE and IORING_SETUP_CLAMP __u32 flags; // see "Flags" below __u32 sq_thread_cpu; // see IORING_SETUP_SQ_AFF __u32 sq_thread_idle; // see IORING_SETUP_SQPOLL __u32 features; // see "Kernel features" below, filled by kernel __u32 wq_fd; // see IORING_SETUP_ATTACH_WQ __u32 resv[3]; // must be zero struct io_sqring_offsets sq_off; // see "Ring structure" below, filled by kernel struct io_cqring_offsets cq_off; // see "Ring structure" below, filled by kernel };
resv must be zeroed out. In the absence of flags, the uring uses interrupt-driven I/O. Calling close(2) on the returned descriptor frees all resources associated with the uring.
io_uring_setup(2) is wrapped by liburing’s io_uring_queue_init(3) and io_uring_queue_init_params(3). When using these wrappers, io_uring_queue_exit(3) should be used to clean up. These wrappers operate on a struct io_uring. io_uring_queue_init(3) takes an unsigned flags argument, which is passed as the flags field of io_uring_params. io_uring_queue_init_params(3) takes a struct io_uring_params* argument, which is passed through directly to io_uring_setup(2). It’s best to avoid mixing the low-level API and that provided by liburing.
Ring structure
The details of ring structure are only relevant when using the low-level API, and they are not exposed via liburing. They’re primarily used to prepare the three (or two, see IORING_FEAT_SINGLE_MMAP) backing memory maps. You’ll need set these up yourself if you want to use huge pages.
struct io_sqring_offsets { __u32 head; __u32 tail; __u32 ring_mask; __u32 ring_entries; __u32 flags; __u32 dropped; __u32 array; __u32 resv[3]; }; struct io_cqring_offsets { __u32 head; __u32 tail; __u32 ring_mask; __u32 ring_entries; __u32 overflow; __u32 cqes; __u32 flags; __u32 resv[3]; };
As explained in the io_uring_setup(2) man page, the submission queue can be mapped thusly:
mmap(0, sq_off.array + sq_entries * sizeof(__u32), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_SQ_RING);
The submission queue contains the internal data structure followed by an array of SQE descriptors. These descriptors are 32 bits each no matter the architecture, implying that they are indices into the SQE map, not pointers. The SQEs are allocated:
mmap(0, sq_entries * sizeof(struct io_uring_sqe), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_SQES);
and finally the completion queue:
mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_CQ_RING);
Recall that when the kernel expresses IORING_FEAT_SINGLE_MMAP, the submission and completion queues can be allocated in one mmap(2) call.
Flags
The flags field is set up by the caller, and is a bitmask over:
Flag | Kernel version | Description |
---|---|---|
IORING_SETUP_IOPOLL | 5.1 | Instruct the kernel to use polled (as opposed to interrupt-driven) I/O. This is intended for block devices, and requires that O_DIRECT was provided when the file descriptor was opened. |
IORING_SETUP_SQPOLL | 5.1 (5.11 for full features) | Create a kernel thread to poll on the submission queue. If the submission queue is kept busy, this thread will reap SQEs without the need for a system call. If enough time goes by without new submissions, the kernel thread goes to sleep, and io_uring_enter(2) must be called to wake it. |
IORING_SETUP_SQ_AFF | 5.1 | Only meaningful with IORING_SETUP_SQPOLL. The poll thread will be bound to the core specified in sq_thread_cpu. |
IORING_SETUP_CQSIZE | 5.1 | Create the completion queue with cq_entries entries. This value must be greater than entries, and might be rounded up to the next power of 2. |
IORING_SETUP_CLAMP | 5.1 | Clamp entries at IORING_MAX_ENTRIES and cq_entries at IORING_MAX_CQ_ENTRIES. |
IORING_SETUP_ATTACH_WQ | 5.1 | Specify a uring in wq_fd, and the new uring will share that uring’s worker thread backend. |
IORING_SETUP_R_DISABLED | 5.10 | Start the uring disabled, requiring that it be enabled with io_uring_register(2). |
IORING_SETUP_SUBMIT_ALL | 5.18 | Continue submitting SQEs from a batch even after one results in error. |
IORING_SETUP_COOP_TASKRUN | 5.19 | Don’t interrupt userspace processes to indicate CQE availability. It’s usually desirable to allow events to be processed at arbitrary kernelspace transitions, in which case this flag can be provided to improve performance. |
IORING_SETUP_TASKRUN_FLAG | 5.19 | Requires IORING_SETUP_COOP_TASKRUN. When completions are pending awaiting processing, the IORING_SQ_TASKRUN flag will be set in the submission ring. This will be checked by io_uring_peek_cqe(), which will enter the kernel to process them. |
IORING_SETUP_SQE128 | 5.19 | Use 128-byte SQEs, necessary for NVMe passthroughs using IORING_OP_URING_CMD. |
IORING_SETUP_CQE32 | 5.19 | Use 32-byte CQEs, necessary for NVMe passthroughs using IORING_OP_URING_CMD. |
IORING_SETUP_SINGLE_ISSUER | 6.0 | Hint to the kernel that only a single thread will submit requests, allowing for optimizations. This thread must either be the thread which created the ring, or (iff IORING_SETUP_R_DISABLED is used) the thread which enables the ring. |
IORING_SETUP_DEFER_TASKRUN | 6.1 | Requires IORING_SETUP_SINGLE_ISSUER. Don’t process completions at arbitrary kernel/scheduler transitions, but only io_uring_enter(2) when called with IORING_ENTER_GETEVENTS by the thread that submitted the SQEs. |
When IORING_SETUP_R_DISABLED is used, the ring must be enabled before submissions can take place. If using the liburing API, this is done via io_uring_enable_rings(3):
int io_uring_enable_rings(struct io_uring *ring); // liburing 2.4
Kernel features
Various functionality was added to the kernel following the initial release of uring, and thus not necessarily available to all kernels supporting the basic system calls. The __u32 features field of the io_uring_params parameter to io_uring_setup(2) is filled in with feature flags by the kernel, a bitmask over:
Feature | Kernel version | Description |
---|---|---|
IORING_FEAT_SINGLE_MMAP | 5.4 | A single mmap(2) can be used for both the submission and completion rings. |
IORING_FEAT_NODROP | 5.5 (5.19 for full features) | Completion queue events are not dropped. Instead, submitting results in -EBUSY until completion reaping yields sufficient room for the overflows. As of 5.19, io_uring_enter(2) furthermore returns -EBADR rather than waiting for completions. |
IORING_FEAT_SUBMIT_STABLE | 5.5 | Data submitted for async can be mutated following submission, rather than only following completion. |
IORING_FEAT_RW_CUR_POS | 5.6 | Reading and writing can provide -1 for offset to indicate the current file position. |
IORING_FEAT_CUR_PERSONALITY | 5.6 | Assume the credentials of the thread calling io_uring_enter(2), rather than the thread which created the uring. Registered personalities can always be used. |
IORING_FEAT_FAST_POLL | 5.7 | Internal polling for data/space readiness is supported. |
IORING_FEAT_POLL_32BITS | 5.9 | IORING_OP_POLL_ADD accepts all epoll flags, including EPOLLEXCLUSIVE. |
IORING_FEAT_SQPOLL_NONFIXED | 5.11 | IORING_SETUP_SQPOLL doesn’t require registered files. |
IORING_FEAT_ENTER_EXT_ARG | 5.11 | io_uring_enter(2) supports struct io_uring_getevents_arg. |
IORING_FEAT_NATIVE_WORKERS | 5.12 | Async helpers use native workers rather than kernel threads. |
IORING_FEAT_RSRC_TAGS | 5.13 | Registered buffers can be updated in partes rather than in toto. |
IORING_FEAT_CQE_SKIP | 5.17 | IOSQE_CQE_SKIP_SUCCESS can be used to inhibit CQE generation on success. |
IORING_FEAT_LINKED_FILE | 5.17 | Defer file assignment until execution of a given request begins. |
Registered resources
Buffers
Since Linux 5.7, user-allocated memory can be provided to uring in groups of buffers (each with a group ID), in which each buffer has its own ID. This was done with the io_uring_prep_provide_buffers(3) call, operating on an SQE. Since 5.19, the “ringmapped buffers” technique (io_uring_register_buf_ring(3)) allows these buffers to be used much more effectively.
Flag | Kernel | Description |
---|---|---|
IORING_REGISTER_BUFFERS | 5.1 | |
IORING_UNREGISTER_BUFFERS | 5.1 | |
IORING_REGISTER_BUFFERS2 | 5.13 |
struct io_uring_rsrc_register { __u32 nr; __u32 resv; __u64 resv2; __aligned_u64 data; __aligned_u64 tags; }; |
IORING_REGISTER_BUFFERS_UPDATE | 5.13 |
struct io_uring_rsrc_update2 { __u32 offset; __u32 resv; __aligned_u64 data; __aligned_u64 tags; __u32 nr; __u32 resv2; }; |
IORING_REGISTER_PBUF_RING | 5.19 |
struct io_uring_buf_reg { __u64 ring_addr; __u32 ring_entries; __u16 bgid; __u16 pad; __u64 resv[3]; }; |
IORING_UNREGISTER_PBUF_RING | 5.19 |
Registered files
Registered (sometimes “direct”) descriptors are integers corresponding to private file handle structures internal to the uring, and can be used anywhere uring wants a file descriptor through the IOSQE_FIXED_FILE flag. They have less overhead than true file descriptors, which use structures shared among threads. Note that registered files are required for submission queue polling unless the IORING_FEAT_SQPOLL_NONFIXED feature flag was returned.
Flag | Kernel | Description |
---|---|---|
IORING_REGISTER_FILES | 5.1 | |
IORING_UNREGISTER_FILES | 5.1 | |
IORING_REGISTER_FILES2 | 5.13 | |
IORING_REGISTER_FILES_UPDATE | 5.5 (5.12 for all features) | |
IORING_REGISTER_FILES_UPDATE2 | 5.13 | |
IORING_REGISTER_FILE_ALLOC_RANGE | 6.0 |
struct io_uring_file_index_range { __u32 off; __u32 len; __u64 resv; }; |
Personalities
Flag | Kernel | Description |
---|---|---|
IORING_REGISTER_PERSONALITY | 5.6 | |
IORING_UNREGISTER_PERSONALITY | 5.6 |
Submitting work
Submitting work consists of four steps:
- Acquiring free SQEs
- Filling in those SQEs
- Placing those SQEs at the tail of the submission queue
- Submitting the work, possibly using a system call
The SQE structure
struct io_uring_sqe has several large unions which I won’t reproduce in full here; consult liburing.h if you want the details. The instructive elements include:
struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ ... various unions for representing the request details ... };
Flags can be set on a per-SQE basis using io_uring_sqe_set_flags(3), or writing to the flags field directly:
static inline void io_uring_sqe_set_flags(struct io_uring_sqe *sqe, unsigned flags){ sqe->flags = (__u8) flags; }
The flags are a bitfield over:
SQE flag | Description |
---|---|
IOSQE_FIXED_FILE | References a registered descriptor. |
IOSQE_IO_DRAIN | Issue after in-flight I/O. |
IOSQE_IO_LINK | Links next SQE. |
IOSQE_IO_HARDLINK | Same as IOSQE_IO_HARDLINK, but a failure does not sever the chain. |
IOSQE_ASYNC | Always operate asynchronously. |
IOSQE_BUFFER_SELECT | Use a registered buffer. |
IOSQE_CQE_SKIP_SUCCESS | Don’t post a CQE on success. |
Prepping SQEs
Each SQE must be seeded with the object upon which it acts (usually a file descriptor) and any necessary arguments. You’ll usually also use the user data area.
User data
Each SQE provides 64 bits of user-controlled data which will be copied through to any generated CQEs. Since CQEs don’t include the relevant file descriptor, you’ll almost always be encoding some kind of lookup information into this area.
void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *user_data); void io_uring_sqe_set_data64(struct io_uring_sqe *sqe, __u64 data); void *io_uring_cqe_get_data(struct io_uring_cqe *cqe); __u64 io_uring_cqe_get_data64(struct io_uring_cqe *cqe);
Here’s an example C++ data type that encodes eight bits as an operation type, eight bits as an index, and forty-eight bits as other data. I typically use something like this to reflect the operation which was used, the index into some relevant data structure, and other information about the operation (perhaps an offset or a length):
union URingCtx { struct rep { rep(uint8_t op, unsigned ix, uint64_t d): type(static_cast<URingCtx::rep::optype>(op)), idx(ix), data(d) { if(type >= MAXOP){ throw std::invalid_argument("bad uringctx op"); } if(ix > MAXIDX){ throw std::invalid_argument("bad uringctx index"); } if(d > 0xffffffffffffull){ throw std::invalid_argument("bad uringctx data"); } } enum optype: uint8_t { ...app-specific types... MAXOP // shouldn't be used } type: 8; uint8_t idx: 8; uint64_t data: 48; } r; uint64_t val; static constexpr auto MAXIDX = 255u; URingCtx(uint8_t op, unsigned idx, uint64_t d): r(op, idx, d) {} URingCtx(uint64_t v): URingCtx(v & 0xffu, (v >> 8) & 0xffu, v >> 16) {} };
The majority of I/O-related system calls have by now a uring equivalent (the one major exception of which I’m aware is directory listing; there seems to be no readdir(3)/getdents(2)). What follows is an incomplete list.
Opening and closing file descriptors
void io_uring_prep_openat(struct io_uring_sqe *sqe, int dfd, const char *path, int flags, mode_t mode); void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, const char *path, int flags, mode_t mode, unsigned file_index); void io_uring_prep_openat2(struct io_uring_sqe *sqe, int dfd, const char *path, int flags, struct open_how *how); void io_uring_prep_openat2_direct(struct io_uring_sqe *sqe, int dfd, const char *path, int flags, struct open_how *how, unsigned file_index); void io_uring_prep_accept(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags); void io_uring_prep_accept_direct(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags, unsigned int file_index); void io_uring_prep_multishot_accept(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags); void io_uring_prep_multishot_accept_direct(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags); void io_uring_prep_close(struct io_uring_sqe *sqe, int fd); void io_uring_prep_close_direct(struct io_uring_sqe *sqe, unsigned file_index); void io_uring_prep_socket(struct io_uring_sqe *sqe, int domain, int type, int protocol, unsigned int flags); void io_uring_prep_socket_direct(struct io_uring_sqe *sqe, int domain, int type, int protocol, unsigned int file_index, unsigned int flags); void io_uring_prep_socket_direct_alloc(st