A few years ago I was really bothered by an academic paper.
Some researchers in France put together a comparison showing lots of ways you could implement realtime collaborative editing (like Google Docs). They implemented lots of algorithms – CRDTs and OT algorithms and stuff. And they benchmarked them all to see how they perform. (Cool!!) Some algorithms worked reasonably well. But others took upwards of 3 seconds to process simple paste operations from their editing sessions. Yikes!
Which algorithm was that? Well, this is awkward but .. it was mine. I mean, I didn’t invent it – but it was the algorithm I was using for ShareJS. The algorithm we used for Google Wave. The algorithm which – hang on – I knew for a fact didn’t take 3 seconds to process large paste events. Whats going on here?
I took a closer look at the paper. In their implementation when a user pasted a big chunk of text (like 1000 characters), instead of creating 1 operation with 1000 characters, their code split the insert into 1000 individual operations. And each of those operations needed to be processed separately. Do’h – of course it’ll be slow if you do that! This isn’t a problem with the operational transformation algorithm. This is just a problem with their particular implementation.
The infuriating part was that several people sent me links to the paper and (pointedly) asked me what I think about it. Written up as a Published Science Paper, these speed comparisons seemed like a Fact About The Universe. And not what they really were – implementation details of some java code, written by a probably overstretched grad student. One of a whole bunch of implementations that they needed to code up.
“Nooo! The peer reviewed science isn’t right everybody! Please believe me!”. But I didn’t have a published paper justifying my claims. I had working code but it felt like none of the smart computer science people cared about that. Who was I? I was nobody.
Even talking about this stuff we have a language problem. We describe each system as an “algorithm”. Jupiter is an Algorithm. RGA is an Algorithm. But really there are two very separate aspects:
- The black-box behaviour of concurrent edits. When two clients edit the same region of text at the same time, what happens? Are they merged, and if so in what order? What are the rules?
- The white-box implementation of the system. What programming language are we using? What data structures? How well optimized is the code?
If some academic’s code runs slowly, what does that actually teach us? Maybe it’s like tests. A passing test suite suggests, but can never prove that there are no bugs. Likewise a slow implementation suggests, but can never prove that every implementation of the system will be slow. If you wait long enough, somebody will find more bugs. And, maybe, someone out there can design a faster implementation.
Years ago I translated my old text OT code into C, Javascript, Go, Rust and Swift. Each implementation has the same behaviour, and the same algorithm. But the performance is not even close. In javascript my transform function ran about 100 000 times per second. Not bad! But the same function in C does 20M iterations per second. That’s 200x faster. Wow!
Were the academics testing a slow version or the fast version of this code? Maybe, without noticing, they had fast versions of some algorithms and slow versions of others. It’s impossible to tell from the paper!
Making CRDTs fast
So as you may know, I’ve been getting interested in CRDTs lately. For the uninitiated, CRDTs (Conflict-Free Replicated Data types) are fancy programming tools which let multiple users edit the same data at the same time. They let you work locally with no lag. (You don’t even have to be online). And when you do sync up with other users & devices, everything just magically syncs up and becomes eventually consistent. The best part of CRDTs is that they can do all that without even needing a centralized computer in the cloud to monitor and control everything.
I want Google Docs without google. I want my apps to seamlessly share data between all my devices, without me needing to rely on some flakey startup‘s servers to still be around in another decade. I think they’re the future of collaborative editing. And maybe the future of all software – but I’m not ready to talk about that yet.
But most CRDTs you read about in academic papers are crazy slow. A decade ago I decided to stop reading academic papers and dismissed them. I assumed CRDTs had some inherent problem. A GUID for every character? Nought but madness comes from those strange lands! But – and this is awkward to admit – I think I’ve been making the same mistake as those researchers. I was reading papers which described the behaviour of different systems. And I assumed that meant we knew how the best way to implement those systems. And wow, I was super wrong.
How wrong? Well. Running this editing trace, Automerge (a popular CRDT, written by a popular researcher) takes nearly 5 minutes to run. I have a new implementation that can process the same editing trace in 56 milliseconds. Thats 0.056 seconds, which is over 5000x faster. It’s the largest speed up I’ve ever gotten from optimization work – and I’m utterly delighted by it.
Lets talk about why automerge is currently slow, and I’ll take you through all the steps toward making it super fast.
Wait, no. First we need to start with:
What is automerge?
Automerge is a library to help you do collaborative editing. It’s written by Martin Kleppmann, who’s a little bit famous from his book and excellent talks. Automerge is based on an algorithm called RGA, which you can read about in an academic paper if you’re into that sort of thing.
Martin explains automerge far better than I will in this talk from 2020:
Automerge (and Yjs and other CRDTs) think of a shared document as a list of characters. Each character in the document gets a unique ID, and whenever you insert into the document, you name what you’re inserting after.
Imagine I type “abc” into an empty document. Automerge creates 3 items:
- Insert ‘a’ id
(seph, 0)
afterROOT
- Insert ‘b’ id
(seph, 1)
after(seph, 0)
- Insert ‘c’ id
(seph, 2)
after(seph, 1)
- Insert ‘c’ id
- Insert ‘b’ id
We can draw this as a tree!
Lets say Mike inserts an ‘X’ between a and b, so we get “aXbc”. Then we have:
- Insert ‘a’ id
(seph, 0)
afterROOT
- Insert ‘X’ id
(mike, 0)
after(seph, 0)
- Insert ‘b’ id
(seph, 1)
after(seph, 0)
- Insert ‘c’ id
(seph, 2)
after(seph, 1)
- Insert ‘c’ id
- Insert ‘X’ id
Note the ‘X’ and ‘b’ both share the same parent. This will happen when users type concurrently in the same location in the document. But how do we figure out which character goes first? We could just sort using their agent IDs or something. But argh, if we do that the document could end up as abcX, even though Mike inserted X before the b. That would be really confusing.
Automerge (RGA) solves this with a neat hack. It adds an extra integer to each item called a sequence number. Whenever you insert something, you set the new item’s sequence number to be 1 bigger than the biggest sequence number you’ve ever seen:
- Insert ‘a’ id
(seph, 0)
afterROOT
, seq: 0- Insert ‘X’ id
(mike, 0)
after(seph, 0)
, seq: 3 - Insert ‘b’ id
(seph, 1)
after(seph, 0)
, seq: 1- Insert ‘c’ id
(seph, 2)
after(seph, 1)
, seq: 2
- Insert ‘c’ id
- Insert ‘X’ id
This is the algorithmic version of “Wow I saw a sequence number, and it was this big!” “Yeah? Mine is even bigger!“
The rule is that children are sorted first based on their sequence numbers (bigger sequence number first). If the sequence numbers match, the changes must be concurrent. In that case we can sort them arbitrarily based on their agent IDs. (We do it this way so all peers end up with the same resulting document.)
Yjs – which we’ll see more of later – implements a CRDT called YATA. YATA is identical to RGA, except that it solves this problem with a slightly different hack. But the difference isn’t really important here.
Automerge (RGA)’s behaviour is defined by this algorithm:
- Build the tree, connecting each item to its parent
- When an item has multiple children, sort them by sequence number then by their ID.
- The resulting list (or text document) can be made by flattening the tree with a depth-first traversal.
So how should you implement automerge? The automerge library does it in the obvious way, which is to store all the data as a tree. (At least I think so – after typing “abc” this is automerge’s internal state. Uh, uhm, I have no idea whats going on here. And what are all those Uint8Arrays doing all over the place? Whatever.) The automerge library works by building a tree of items.
For a simple benchmark, I’m going to test automerge using an editing trace Martin himself made. This is a character by character recording of Martin typing up an academic paper. There aren’t any concurrent edits in this trace, but users almost never actually put their cursors at exactly the same place and type anyway, so I’m not too worried about that. I’m also only counting the time taken to apply this trace locally, which isn’t ideal but it’ll do. Kevin Jahns (Yjs’s author) has a much more extensive benchmarking suite here if you’re into that sort of thing. All the benchmarks here are done on my chonky ryzen 5800x workstation, with Nodejs v16.1 and rust 1.52 when that becomes appropriate. (Spoilers!)
The editing trace has 260 000 edits, and the final document size is about 100 000 characters.
As I said above, automerge takes a little under 5 minutes to process this trace. Thats just shy of 900 edits per second, which is probably fine. But by the time it’s done, automerge is using 880 MB of RAM. Whoa! That’s 10kb of ram per key press. At peak, automerge was using 2.6 GB of RAM!
To get a sense of how much overhead there is, I’ll compare this to a baseline benchmark where we just splice all the edits directly into a javascript string. This throws away all the information we need to do collaborative editing, but it gives us a sense of how fast javascript is capable of going. It turns out javascript running on V8 is fast:
Test | Time taken | RAM usage |
---|---|---|
automerge (v1.0.0-preview2) | 291s | 880 MB |
Plain string edits in JS | 0.61s | 0.1 MB |
This is a chart showing the time taken to process each operation throughout the test, averaged in groups of 1000 operations. I think those spikes are V8’s garbage collector trying to free up memory.
In the slowest spike near the end, a single edit took 1.8 seconds to process. Oof. In a real application, the whole app (or browser tab) would freeze up for a couple of seconds sometimes while you’re in the middle of typing.
The chart is easier to read when we average everything out a bit and zoom the Y axis. We can see the average performance gets gradually (roughly linearly) worse over time.
Why is automerge slow though?
Automerge is slow for a whole slew of reasons:
- Automerge’s core tree based data structure gets big and slow as the document grows.
- Automerge makes heavy use of Immutablejs. Immutablejs is a library which gives you clojure-like copy-on-write semantics for javascript objects. This is a cool set of functionality, but the V8 optimizer & GC struggles to optimize code that uses immutablejs. As a result, it increases memory usage and decreases performance.
- Automerge treats each inserted character as a separate item. Remember that paper I talked about earlier, where copy+paste operations are slow? Automerge does that too!
Automerge was just never written with performance in mind. Their team is working on a replacement rust implementation of the algorithm to run through wasm, but at the time of writing it hasn’t landed yet. I got the master branch working, but they have some kinks to work out before it’s ready. Switching to the automerge-rs backend doesn’t make average performance in this test any faster. (Although it does halve memory usage and smooth out performance.)
There’s an old saying with performance tuning:
You can’t make the computer faster. You can only make it do less work.
How do we make the computer do less work here? There’s lots of performance wins to be had from going through the code and improving lots of small things. But the automerge team has the right approach. It’s always best to start with macro optimizations. Fix the core algorithm and data structures before moving to optimizing individual methods. There’s no point optimizing a function when you’re about to throw it away in a rewrite.
By far, Automerge’s biggest problem is its complex tree based data structure. And we can replace it with something faster.
Improving the data structure
Luckily, there’s a better way to implement CRDTs, pioneered in Yjs. Yjs is another (competing) opensource CRDT implementation made by Kevin Jahns. It’s fast, well documented and well made. If I were going to build software which supports collaborative editing today, I’d use Yjs.
Yjs doesn’t need a whole blog post talking about how to make it fast because it’s already pretty fast, as we’ll see soon. It got there by using a clever, obvious data structure “trick” that I don’t think anyone else in the field has noticed. Instead of implementing the CRDT as a tree like automerge does:
Yjs just puts all the items in a single flat list:
That looks simple, but how do you insert a new item into a list? With automerge it’s easy:
- Find the parent item
- Insert the new item into the right location in the parents’ list of children
But with this list approach it’s more complicated:
- Find the parent item
- Starting right after the parent item, iterate through the list until we find the location where the new item should be inserted (?)
- Insert it there, splicing into the array
Essentially, this approach is just a fancy insertion sort. We’re implementing a list CRDT with a list. Genius!
This sounds complicated – how do you figure out where the new item should go? But it’s complicated in the same way math is complicated. It’s hard to understand, but once you understand it, you can implement the whole insert function in about 20 lines of code:
(But don’t be alarmed if this looks confusing – we could probably fit everyone on the planet who understands this code today into a small meeting room.)
I implemented both Yjs’s CRDT (YATA) and Automerge using this approach in my experimental reference-crdts codebase. Here’s the insert function, with a few more comments. The Yjs version of this function is in the same file, if you want to have a look. Despite being very different papers, the logic for inserting is almost identical. And even though my code is very different, this approach is semantically identical to the actual automerge, and Yjs and sync9 codebases. (Fuzzer verified (TM)).
If you’re interested in going deeper on this, I gave a talk about this approach at a braid meeting a few weeks ago.
The important point is this approach is better:
- We can use a flat array to store everything, rather than an unbalanced tree. This makes everything smaller and faster for the computer to process.
- The code is really s