We’re in the process of porting a significant portion of the network I/O code
in EdgeDB from Python to Rust, and we’ve been learning a lot of very
interesting lessons in the process.
We’ve been working on a new HTTP fetch feature for EdgeDB, using reqwest
as our HTTP client library. Everything was going smoothly: the feature
worked locally, passed tests on x86_64 CI runners, and seemed stable.
But then we noticed something strange: the tests started failing
intermittently on our ARM64 CI runners.
At first glance, it looked like a deadlock. The test runner would start,
hang indefinitely, and then the CI job would time out. The logs showed no
errors—just a test spinning forever. And then, after a few hours, the job
would fail with a timeout error.
Here’s what the CI output looked like:
Current runner version: '2.321.0'
Runner name: ''
Runner group name: 'Default'
(... 6 hrs of logs ...)
still running:
test_immediate_connection_drop_streaming (pid=451) for 19874.78s
still running:
test_immediate_connection_drop_streaming (pid=451) for 19875.78s
still running:
test_immediate_connection_drop_streaming (pid=451) for 19876.78s
still running:
test_immediate_connection_drop_streaming (pid=451) for 19877.78s
Shutting down test cluster...
Not much to go on here. It looked like a deadlock causing an async task to
improperly block to us at first. It turns out we were wrong.
Why just ARM64? This didn’t make a lot of sense to us in the beginning.
Our first theories were in the difference of memory models between Intel and
ARM64. Intel has a fairly strict memory model—while some unusual behaviors
can happen, memory writes have a total order that all processors agree on
([1], [2], [3]).
ARM has a much more weakly-ordered memory model [4], where
(among other things) writes may appear in different orders to different
threads.
Since Sully’s Ph.D. thesis was on this stuff [5],
this is when he got pulled in to take a look.
Our nightly CI machines run on Amazon AWS, which has the advantage of giving
us a real, uncontainerized root user. While you can connect github runners via
ssh [6], it’s
nice to have the ability to connect as the true root user to get access to
dmesg and other system logs.
To figure out what was going on, we (Sully and Matt) decided to connect
directly to the ARM64 runner and see what was happening under the hood.
First, we SSH’d into the CI machine to try and find that hung process
to connect to it:
$ aws ssm start-session --region us-west-2 --target i-
$ ps aux | grep "451"
Oh, that’s right! We run the build in a Docker container and it has its own
process namespace:
$ sudo docker exec -it /bin/sh
# ps aux | grep "451"
Wait, hold on. The hung process isn’t there either.
This wasn’t a deadlock — the process had crashed.
It turns out our test runner failed to detect this—but that’s fine, and a fix
for another day. We can see if the process left a coredump. Since a Docker
container is just a process namespace, the core dump gets passed to the Docker
host itself. We can try to find that from outside the container with
journalctl
:
$ sudo journalctl
systemd-coredump: Process 59530 (python3) of user 1000 dumped core.
Stack trace of thread :
...
Aha! We found it. And the core for that process lives in
/var/lib/systemd/coredump/
as expected. Note that we see a different pid
here because of process namespaces: the pid outside of the container (59530)
is different than the one inside (1000).
We loaded the core dump into gdb
to see what happened. Unfortunately,
we were greeted with a number of errors:
$ gdb
(gdb) core-file core.python3.1000.<...>.59530.<...>
warning: Can't open file /lib64/libnss_files-2.17.so during file-backed mapping note processing
warning: Can't open file /lib64/librt-2.17.so during file-backed mapping note processing
warning: Can't open file /lib64/libc-2.17.so during file-backed mapping note processing
warning: Can't open file /lib64/libm-2.17.so during file-backed mapping note processing
warning: Can't open file /lib64/libutil-2.17.so during file-backed mapping note processing
... etc ...
(gdb) bt
#0 0x0000ffff805a3e90 in ?? ()
#1 0x0000ffff806a7000 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
Ack. That’s not useful. We don’t have the necessary files outside of the
container, and our containers are quite minimal and don’t allow us to easily
install gdb
.
Instead, we need to copy the relevant libraries out of the container,
and tell gdb
where the .so
files live:
# mkdir /container
# docker cp :/lib /container
# docker cp :/usr /container
... etc ...
$ gdb
(gdb) set solib-absolute-prefix /container
(gdb) file /container/edgedb/bin/python3
Reading symbols from /container/edgedb/bin/python3...
(No debugging symbols found in /container/edgedb/bin/python3)
(gdb) core-file core.python3.1000.<...>.59530.<...>
(gdb) bt
#0 0x0000ffff805a3e90 in getenv () from /container/lib64/libc.so.6
#1 0x0000ffff8059c174 in __dcigettext () from /container/lib64/libc.so.6
Much better!
But rather than a crash in our new HTTP code, the backtrace revealed
something unexpected:
(gdb) bt
#0 0x0000ffff805a3e90 in getenv () from /container/lib64/libc.so.6
#1 0x0000ffff8059c174 in __dcigettext () from /container/lib64/libc.so.6
#2 0x0000ffff805f263c in strerror_r () from /container/lib64/libc.so.6
#3 0x0000ffff805f254c in strerror () from /container/lib64/libc.so.6
#4 0x00000000005bb76c in PyErr_SetFromErrnoWithFilenameObjects ()
#5 0x00000000004e4c14 in ?? ()
#6 0x000000000049f66c in PyObject_VectorcallMethod ()
#7 0x00000000005d21e4 in ?? ()
#8 0x00000000005d213c in ?? ()
#9 0x00000000005d1ed4 in ?? ()
#10 0x00000000004985ec in _PyObject_MakeTpCall ()
#11 0x00000000004a7734 in _PyEval_EvalFrameDefault ()
#12 0x000000000049ccb4 in _PyObject_FastCallDictTstate ()
#13 0x00000000004ebce8 in ?? ()
#14 0x00000000004985ec in _PyObject_MakeTpCall ()
#15 0x00000000004a7734 in _PyEval_EvalFrameDefault ()
#16 0x00000000005bee10 in ?? ()
#17 0x0000ffff7ee1f5dc in ?? () from /container/.../_asyncio.cpython-312-aarch64-linux-gnu.so
#18 0x0000ffff7ee1fd94 in ?? () from /container/.../_asyncio.cpython-312-aarch64-linux-gnu.so
We disassembled the crashing getenv
function. Knowing that we build our
containers using GLIBC 2.17, we also located the relevant source for getenv
to follow along [7]:
char * getenv (const char *name) {
size_t len = strlen (name);
char **ep;
uint16_t name_start;
if (__environ == NULL || name[0] == ' ')
return NULL;
if (name[1] == ' ') {
name_start = ('=' << 8) | *(const unsigned char *) name;
for (ep = __environ; *ep != NULL; ++ep)