When I was on the Google Docs team, we did a weekly bug triage where we’d look for new issues and randomly assign them to teammates to investigate. One week, we had a new top error by a wide margin.
It was a fatal error. This means that it prevented the user from editing without reloading. It didn’t correspond to a Google Docs release. The stack trace added very little information. There wasn’t an associated spike in user complaints, so we weren’t even sure it was really happening — but if it was happening it would be really bad. It was Chrome-only starting at a specific release. This is less helpful than it sounds, since we often wrote browser-specific Docs bugs that affected only one of Internet Explorer, Firefox, Safari, and Chrome.
I tried to repro in dev. This was important for 2 reasons:
-
Rule out Closure Compiler, Docs’ JavaScript compiler at the time.
-
Debugging in unoptimized code is always easier than the alternative.
Okay, how do I begin? I crawled through our logs for internal users who had suffered from the problem. I hoped that somebody could tell me, “oh yeah, every time I try to do $foo it breaks.” But no internal users had been affected. Back to the drawing board.
I did a bunch of wild edits for a while. I added as many esoteric features as I could, copy/pasted a bunch of stuff into Docs from news websites to try to trigger the issue, played around with tables for a while. No dice.
What next? At the time, Docs had a basic scripting tool that could perform repetative actions. It was mostly useful for performance benchmarking, but because it provided consistent behavior I tried it out. I made a 50-page doc filled with lorem ipsum and had the script bold and unbold the entire document 100 times. Somewhere around the 20th cycle it crashed. I checked my console and it was the error in question!
I do it a few more times. It’s not always the 20th iteration, but it usually happens sometime between the 10th and 40th iteration. Sometimes it never happend. Okay, the bug is nondeterministic. We’re off to a bad start.
I think about the repro case. Is there anything interesting about bolding and unbolding large bodies of text? Yes actually. In many fonts and for many text samples, bolded text is wider than unbolded text. This was true for the font I was using. So it could have something do with wrapping lots of lines of text.
I set a breakpoint and started investigating. The crash looked like it was caused by some bad bookkeeping in the view, because the actual crash was reading a garbage cache value and trying to operate on it, and crashing as a result.
At the time, Go
8 Comments
protocolture
I had something like this once.
Vendor provided an outlook plugin (ew) that linked storage directly in outlook (double ew) and contained a built in pdf viewer (disgusting) for law firms to manage their cases.
One user, regardless of PC, user account or any other isolation factor, would reliably crash the program and outlook with it.
She could work for 40 minutes on another users logged in account on another PC and reproduce the issue.
Turns out it was a memory allocation issue. When you open a file saved in the addons storage, via the built in pdf viewer, it would allocate memory for it. However, when you close the pdf file, it would not deallocate that memory. After debugging her usage for some time, I noted that there was a memory deallocation, but it was performed at intervals.
If there were 20 or so pdf allocations and then she switched customer case file before a deallocation, regardless of available memory, the memory allocation system in the addon would shit the bed and crash.
This one user, an absolute powerhouse of a woman I must say, could type 300 wpm and would rapidly read -> close -> assign -> allocate -> write notes faster than anyone I have ever seen before. We legitimately got her to rate limit herself to 2 files per 10 minutes as an initial workaround while waiting for a patch from the vendor.
I had to write one hell of a bug report to the vendor before they would even look at it. Naturally they could not reproduce the error through their normal tests and tried closing the bug on me several times. The first update they rolled out upped it to something like 40 pdfs viewed every 15 minutes. But she still managed to touch the new ceiling on occasion (I imagine billing each of those customers 7 minutes a pop or whatever law firms do) and ultimately they had to rewrite the entire memory system.
cellular
If this was a regression, could a binary search be done on check-ins? Or is the code too distributed?
BobbyTables2
Interesting writeup, but 2 days to debug “the hardest bug ever”, while accurate, seems a bit overdone.
Though abs() returning negative numbers is hilarious.. “You had one job…”
To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.
I’m not just talking about concurrency issues either…
The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.
2 days is cute though.
MrMcCall
The early-to-mid-90s "High C/C++" compiler had a bug in its floating point library for basic math functions. It ended up being a bit of a Heisenbug to track down, and I didn't initially believe it wasn't my code, but it actually ended up being in their supplied library.
It took me maybe three days to track down, from first clues to final resolution, on a 486/50 luggable with the orange on black monochrome built-in screen.
jonnycoder
I’m not even close to being on par with other faang engineers but this is far from being a very difficult bug in my experience. The hardest bugs are the ones where the repro takes days to repro. But nonetheless the op’s tenacity is all that matters and I would trust them to solve any of the hard problems Ive faced in the past.
nneonneo
FWIW: this type of bug in Chrome is exploitable to create out-of-bounds array accesses in JIT-compiled JavaScript code.
The JIT compiler contains passes that will eliminate unnecessary bounds checks. For example, if you write “var x = Math.abs(y); if(x >= 0) arr[x] = 0xdeadbeef;”, the JIT compiler will probably delete the if statement and the internal nonnegative array index check inside the [] operator, as it can assume that x is nonnegative.
However, if Math.abs is then “optimized” such that it can produce a negative number, then the lack of bounds checks means that the code will immediately access a negative array index – which can be abused to rewrite the array’s length and enable further shenanigans.
Further reading about a Chrome CVE pretty much exactly in this mold: https://shxdow.me/cve-2020-9802/
danielodievich
When I was 12 I was just learning stuff and wrote something in C, which crashed at unpredictable intervals and I could not explain it. I took it to my 14 year old uncle who was better than me at coding for help. Now mind you this is ~ 40 years ago but I seem to remember that Borland Turbo C (I still love that IDE blue color) had debugging with breakpoints (mind blowing!) which eventually led to "duh you didn't dispose of your pointer and are reusing it and the memory there is now garbage" or something like that. I vaguely recall * or * being somewhere nearby. This was my first intro to RTFM and debugging and what a powerful intro.
hatmanstack
I'll clear my schedule.
the best line of the piece.