21 March 2025
I’ve modified the awesome rr
debugger so that it can run
without needing access to CPU Hardware Performance counters.
This allows rr
to be used in many more environments like cloud
VMs and containers where access to CPU HW performance counters is
usually disabled. Upstream rr
requires access to CPU HW performance
counters to function, so this is a new feature and opens many
additional possibilities to debug your software with rr
.
I call this variant, Software Counters mode rr
. Record and
Replay systems like rr
deserve to be able to run everywhere and this
is my attempt at making this possible !
Running rr record/replay without access to CPU HW performance
counters is accomplished using lightweight dynamic (and static)
instrumentation. The Software Counters mode rr
wiki has more details
in case you’re curious about some more of the internals.
To build, install and run Software Counters mode rr please visit
https://github.com/sidkshatriya/rr.soft
Continue reading to understand some of the basic concepts
behind Record and Replay and why it is such a powerful technique
when debugging programs.
When you watch a YouTube video, there is the unspoken expectation
that when you rewind or go forward an arbitrary number of seconds the
video is exactly the same as it had been when it was recorded !
Note
Once uploaded, the Gangnam Style video doesn’t change !
You don’t see something new in the video that was never there ! The
video and audio is exactly the same. In fact, if the video was
slightly different when you replayed it again and again
it would be extremely strange. You would probably think you were
going mad if the video were slightly different on each replay :-)
!
Allowing you to go forwards and backwards in time when viewing a
video is critical. Once a video is uploaded to the YouTube website,
it is essentially “frozen”. This allows anybody to view the same
video any number of times and concentrate or learn from parts of
the video that are most interesting.
Let us shift focus from YouTube videos to computer programs.
Let’s say you’re attempting to run a program again and again because
you are facing some errors. When you try to debug an error, a good
strategy is try to give the program the same input because you want
to understand why it is failing for your specific input.
But sadly, every single time you run a program things are slightly
different:
- Your user input like keystrokes may be the same but the timing of these
keystrokes can be subtly different. This could result in slightly different
internal state of a TUI (Text User Interface) program from run to run even though
the raw text you may have entered is the same - If the program has a graphical IDE, even though you may click on the same buttons, the
mouse speed or mouse path is slightly different from the last run or exact mouse click
locations differ by a few pixels - If the program makes network calls, remote servers might respond to requests
slightly differently from run to run — there might be network failures, or
because the state of the remote servers is different they might still respond differently - If the program depends on time or random numbers the program might run a bit differently
as the time has changed or different random numbers are generated on every run - If the program depends on files on disk, the files may have gotten modified the last time the program
ran and as a result the error may not appear again. It may be difficult to restore the
files to the state in which the same error in the program happens again. - If the program is multithreaded, the different threads might interleave in slightly
differently ways from run to run so the results might be different
I hope you get the picture: every run of a sufficiently complex program is a special “Snowflake”
even though you may try to give it the same input.
So when a program goes wrong you have not one but two simultaneous
problems:
- Find where the program bug is in the code
- Try to set the conditions of your system and replicate
user input (and often subtle things like thread interleavings) in such a way that the program
when run again shows the same bug.
But as discussed above, so many things can change from run to run
! In fact, this is the age old problem of engineers “But it works
for me!” or alternatively “There was horrible error, likely to
appear in production… but I don’t know how to get the error again
!”
What if while running the program you were simultaneously recording
it using a record/replay facility ? Think of a video camera of
sorts, but for programs. So that when you replay it back, it runs
in exactly the same way: the CPU instructions executed are exactly
the same, when files are “read” or “written” or when network calls
are made, the results are exactly the same and so on and so forth.
As another example, say you recorded a vim/emacs session using this
record/replay facility. When replaying the recording, the program
code thinks that you’ve pressed exactly the same keys in the same
exact time sequence. The internal state of vim/emacs will be exactly
the same ! Now if there was a crash in vim/emacs that occured
during the record phase you’d be able to review it again (by replaying
again) and drill down
4 Comments
yjftsjthsd-h
Well! That eliminates one of if not the biggest problems with rr:) Is there some catch or tradeoff? Performance, maybe?
IshKebab
I had to scroll a very long way to get to the most important bit:
> Running rr record/replay without access to CPU HW performance counters is accomplished using lightweight dynamic (and static) instrumentation. The Software Counters mode rr wiki has more details in case you're curious about some more of the internals.
You should move that to the top.
db48x
Very cool. It’s difficult to praise rr too much, and it just keeps getting better. If you’re not using it, you’re missing out on a superpower.
stuaxo
Very nice.
Has anyone got rr working with python?