An introduction to a KVM-based single-process sandbox
Hey All. In between working on my PhD, libriscv and an untitled game (it’s too much I know), I also have been working on a KVM sandbox for single programs. A so-called userspace emulator. I wanted to make the worlds fastest sandbox using hardware virtualization, or at least I had an idea of what I wanted to do.
I wrote a blog post about sandboxing each and every request in Varnish back in 2021, titled Virtual Machines for Multi-tenancy in Varnish. In it, I wrote that I would look into using KVM for sandboxing instead of using a RISC-V emulator. And so… I went ahead and wrote TinyKVM.
So, what is TinyKVM and what does it bring to the table?
TinyKVM executes regular Linux programs with the same results as native execution.
TinyKVM can be used to sandbox regular Linux programs or programs with specialized APIs embedded into your servers.
TinyKVM’s design
In order to explain just what TinyKVM is, I’m just going to list explicit features that are currently implemented and working as intended:
TinyKVM runs static Linux ELF programs. It can also be extended with an API made by you to give it access to eg. an outer HTTP server or cache. I’ll also be adding dynamic executable support eventually. ⏳ It currently runs on AMD64 (x86_64), and I will port it to AArch64 (64-bit ARM) at some later point in time.
TinyKVM creates hugepages where possible for guest pages. It can also use hugepages on the host in addition. The result is often (if not always) higher performance than a vanilla native program. Just to hammer this a bit in: https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-For-Code found that just allocating 2MB pages for the execute segment gave a 5% compilation boost for the LLVM codebase.
I quickly allocated some hugepages and ran TinyKVM w/STREAM, and yes it’s quite a bit faster.
TinyKVM has only 2us overhead when calling a function in the guest. This may seem like much compared to my RISC-V emulators 3ns, however we are entering another process, and we get to use all of our CPU features.
TinyKVM can halt execution after a given time without any thread or signal setup during the call. This is unavailable to regular Linux programs. With no execution timeout the call overhead is 1.2us, as we don’t need a timer anymore.
TinyKVM can be remotely debugged with GDB. The program can be resumed after debugging, and I’ve actually used that to live-debug a request in Varnish and see it complete normally afterwards. Cool stuff, if I may say so.
TinyKVM can fork itself into copies that use copy-on-write to allow for huge workloads like LLMs to share most memory. As an example, 6GB weights required only 260MB working memory per instance, making it highly scalable.
TinyKVM forks can reset themselves in record time to a previous state using mechanisms unavailable to regular Linux programs. If security is important, VMs can be made ephemeral by resetting them after every request. Thus removing the possibility of seeing traces of previous requests and many classes of attacks are made impossible as any form of persistence is no longer possible. A TinyKVM instance can also be reset to another VM it was not forked from at a
17 Comments
wmf
Fascinating but I'm having trouble understanding the big picture. This runs a user process in a VM with no kernel? Does every system call become a VM exit and get proxied to the host? Or are there no system calls?
chatmasta
I love this. Please never stop doing what you’re doing.
edit: Of course you’re the top contributor to IncludeOS. That was the first project I thought of while reading this blog post. I’ve been obsessed with the idea of Network Function Virtualization for a long time. It’s the most natural boundary for separating units of work in a distributed system and produces such clean abstractions and efficient scaling mechanisms.
(I’m also a very happy user of Varnish in production btw. It’s by far the most reliable part of the stack, even more than nginx. Usually I forget it’s even there. It’s never been the cause of a bug, once I got it configured properly.)
dangoodmanUT
quick someone make rust bindings
nine_k
Oh. It's like Firecracker, only much faster 8-)
What I like most is the ability to instantly reset the state of the VM to a known predefined state. It's like restarting the VM without any actual restart. It looks like an ideal course of action for network-facing services that are constantly under attack: even if an attack succeeds, the result is erased on the next request.
Easy COW page sharing for programs that are not written with that in mind, like ML model runners, is also pretty nice.
ruben_varnish
Original post: https://fwsgonzo.medium.com/tinykvm-the-fastest-sandbox-564a…
You can find a bunch of posts related to this topic there as well.
gunian
man see virtualization man happy man see it no crossplatform man sad
jensneuse
Is this a modern version of CGI with process isolation?
notpushkin
This is so cool.
I’m exploring micro-VMs for my self-hosted PaaS, https://lunni.dev/ – and something with such little overhead seems like a really interesting option!
winternewt
I'm curious: would it be a good idea to switch my desktop Linux pc to using huge pages across the board?
tuananh
this is really cool if it works for your use cases.
Some notes from the post
> I found that TinyKVM ran at 99.7% native speed
> As long as they are static and don’t need file or network access, they might just run out-of-the box.
> The TinyKVM guest has a tiny kernel which cannot be modified
conradev
Could this be used to migrate execution of a single program between two different machines?
Tepix
Interesting to see the performance gain.
But without file i/o and network access, what are the use cases?
laurencerowe
This is really exciting. The 2.5us snapshot restore performance is on a par with Wasmtime but with the huge advantage of being able to run native code, albeit with the disadvantage of much slower but still microsecond interop.
I see there is a QuickJS demo in the tinykvm_examples repo already but it'd be great to see if it's possible to get a JIT capable JavaScript runtime working as that will be an order of magnitude faster. From my experiments with server rendering a React app native QuickJS was about 12-20ms while v8 was 2-4ms after jit warmup.
I need to study this some more but I'd love to get to the point where there was a single Deno like executable that ran inside the sandbox and made all http requests through Varnish itself. A snapshot would be taken after importing the specified JS URl and then each request would run in an isolated snapshot.
Probably needs a mechanism to reset the random seed per request.
oulipo
I'm new to this area, can someone ELI5 this? What's the difference/advantages/disadvantages compared to other process isolation like containers?
Would I use this to run a distributed infra on a server a bit like docker-compose? or it's not related?
jedisct1
Quicky someone make Zig bindings.
rwmj
Isn't this basically libkrun? https://github.com/containers/libkrun
curtisszmania
[dead]