2025-04-02
7 min read
So the story begins with a pair programming session I had with my colleague, which I desperately needed because my node skill tree is still at level 1, and I needed to get started with React because I’ll be working on our internal backstage instance.
We worked together on a small feature, tested it locally, and it worked. Great. Now it’s time to make My Very First React Commit. So I ran the usual git add
and git commit
, which hooked into yarn test
, to automatically run unit tests for backstage, and that’s when everything got derailed. For all the React tutorials I have followed, I have never actually run a yarn test
on my machine. And the first time I tried yarn test, it hung, and after a long time, the command eventually failed:
Determining test suites to run...
● Test suite failed to run
thrown: [Error]
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
🌈 backstage ⚡
I could tell it was obviously unhappy about something, and then it threw some [Error]. I have very little actual JavaScript experience, but this looks suspiciously like someone had neglected to write a proper toString() or whatever, and thus we’re stuck with the monumentally unhelpful [Error]. Searching the web yielded an entire ocean of false positives due to how vague the error message is. What a train wreck!
Fine, let’s put on our troubleshooting hats. My memory is not perfect, but thankfully shell history is. Let’s see all the (ultimately useless) things that were tried (with commentary):
2025-03-19 14:18 yarn test --help
2025-03-19 14:20 yarn test --verbose
2025-03-19 14:21 git diff --staged
2025-03-19 14:25 vim README.md # Did I miss some setup?
2025-03-19 14:28 i3lock -c 336699 # "I need a drink"
2025-03-19 14:34 yarn test --debug # Debug, verbose, what's the diff
2025-03-19 14:35 yarn backstage-cli repo test # Maybe if I invoke it directly ...
2025-03-19 14:36 yarn backstage-cli --version # Nope, same as mengnan's
2025-03-19 14:36 yarn backstage-cli repo --help
2025-03-19 14:36 yarn backstage-cli repo test --since HEAD~1 # Minimal changes?
2025-03-19 14:36 yarn backstage-cli repo test --since HEAD # Uhh idk no changes???
2025-03-19 14:38 yarn backstage-cli repo test plugins # The first breakthrough. More on this later
2025-03-19 14:39 n all tests.n › Press f to run only failed tests.n › Press o to only run tests related to changed files.n › Pres
filter by a filename regex pattern.n › Press t to filter by a test name regex pattern.n › Press q to quit watch mode.n › Press Ent
rigger a test run all tests.n › Press f to run only failed tests.n › Press o to only run tests related to changed files.n › Press
lter by a filename regex pattern.n › Press t to filter by a test name regex pattern.n › Press q to quit watch mode.n › Press Enter
gger a test ru # Got too excited and pasted rubbish
2025-03-19 14:44 ls -a | fgrep log
2025-03-19 14:44 find | fgrep log # Maybe it leaves a log file?
2025-03-19 14:46 yarn backstage-cli repo test --verbose --debug --no-cache plugins # "clear cache"
2025-03-19 14:52 yarn backstage-cli repo test --no-cache --runInBand . # No parallel
2025-03-19 15:00 yarn backstage-cli repo test --jest-help
2025-03-19 15:03 yarn backstage-cli repo test --resetMocks --resetModules plugins # I have no idea what I'm resetting
The first real breakthrough was test plugins
, which runs only tests matching “plugins”. This effectively bypassed the “determining suites to run…” logic, which was the thing that was hanging. So, I am now able to get tests to run. However, these too eventually crash with the same cryptic [Error]
:
PASS @cloudflare/backstage-components plugins/backstage-components/src/components/Cards/TeamMembersListCard/TeamMembersListCard.test.tsx (6.787 s)
PASS @cloudflare/backstage-components plugins/backstage-components/src/components/Cards/ClusterDependencyCard/ClusterDependencyCard.test.tsx
PASS @internal/plugin-software-excellence-dashboard plugins/software-excellence-dashboard/src/components/AppDetail/AppDetail.test.tsx
PASS @cloudflare/backstage-entities plugins/backstage-entities/src/AccessLinkPolicy.test.ts
● Test suite failed to run
thrown: [Error]
Re-running it or matching different tests will give slightly different run logs, but they always end with the same error.
By now, I’ve figured out that yarn test is actually backed by Jest, a JavaScript testing framework, so my next strategy is simply trying different Jest flags to see what sticks, but invariably, none do:
2025-03-19 15:16 time yarn test --detectOpenHandles plugins
2025-03-19 15:18 time yarn test --runInBand .
2025-03-19 15:19 time yarn test --detectLeaks .
2025-03-19 15:20 yarn test --debug aetsnuheosnuhoe
2025-03-19 15:21 yarn test --debug --no-watchman nonexisis
2025-03-19 15:21 yarn test --jest-help
2025-03-19 15:22 yarn test --debug --no-watch ooooooo > ~/jest.config
A pattern finally emerges
Eventually, after re-running it so many times, I started to notice a pattern. So by default after a test run, Jest drops you into an interactive menu where you can (Q)uit, Run (A)ll tests, etc. and I realized that Jest would eventually crash, even if it’s idling in the menu. I started timing the runs, which led me to the second breakthrough:
› Press q to quit watch mode.
› Press Enter to trigger a test run.
● Test suite failed to run
thrown: [Error]
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test . 109.96s user 14.21s system 459% cpu 27.030 total
RUNS @cloudflare/backstage-components plugins/backstage-components/src/components/Cards/TeamRoles/CustomerSuccessCard.test.tsx
RUNS @cloudflare/backstage-app packages/app/src/components/catalog/EntityFipsPicker/EntityFipsPicker.test.tsx
Test Suites: 2 failed, 23 passed, 25 of 65 total
Tests: 217 passed, 217 total
Snapshots: 0 total
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test . 110.85s user 14.04s system 463% cpu 26.974 total
No matter what Jest was doing, it always crashes after almost exactly 27 wallclock seconds. It literally didn’t matter what tests I selected or re-ran. Even the original problem, a bare yarn test (no tests selected, just hangs), will crash after 27 seconds:
Determining test suites to run...
● Test suite failed to run
thrown: [Error]
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test 2.05s user 0.71s system 10% cpu 27.094 total
Obviously, some sort of timeout. 27 seconds is kind of a weird number (unlike, say, 5 seconds or 60 seconds) but let’s try:
2025-03-19 15:09 find | fgrep 27
2025-03-19 15:09 git grep 'b27b'
No decent hits.
How about something like 20+7 or even 20+5+2? Nope.
Googling/GPT-4oing for “jest timeout 27 seconds” again yielded nothing useful. Far more people were having problems with testing asynchronously, or getting their tests to timeout, than with Jest proper.
At this time, my colleague came back from his call, and with his help we determined some other things:
-
his system (MacOS) is not affected at all versus mine (Linux)
-
nvm use v20 didn’t fix it
-
I can reproduce it on a clean clone of github.com/backstage/backstage. The tests seem to progress further, about 50+ seconds. This lends credence to a running theory that the filesystem crawler/watcher is the one crashing, and backstage/backstage is a bigger repo than the internal Cloudflare instance, so it takes longer.
I next went on a little detour to grab another colleague who I know has been working on a Next.js project. He’s one of the few other people nearby who knows anything about Node.js. In my experience with troubleshooting it’s helpful to get multiple perspectives, so we can cover each other’s blind spots and avoid tunnel vision.
I then tried invoking many yarn tests in parallel, and I did manage to get the crash time to stretch out to 28 or 29 seconds if the system was under heavy load. So this tells me that it might not be a hard timeout but rather processing driven. A series of sleeps chugging along perhaps?
By now, there is a veritable crowd of curious onlookers gathered in front of my terminal marveling at the consistent 27 seconds crash and trading theories. At some point, someone asked if I had tried rebooting yet, and I had to sheepishly reply that I haven’t but “I’m absolutely sure it wouldn’t help whatsoever”.
And the astute reader can already guess that rebooting did nothing at all, or else this wouldn’t even be a story worth telling. Besides, haven’t I teased in the clickbaity title about some crazy Steam Locomotive from 1993?
My colleague then put us back on track and suggested strace, and I decided to trace the simpler case of the idling menu (rather than trace running tests, which generated far more syscalls).
Watch Usage
› Press a to run all tests.
› Press f to run only failed tests.
› Press o to only run tests related to changed files.
› Press p to filter by a filename regex pattern.
› Press t to filter by a test name regex pattern.
› Press q to quit watch mode.
› Press Enter to trigger a test run.
[], 1024, 1000) = 0
openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 21
read(21, "42375 (node) R 42372 42372 11692"..., 1023) = 301
close(21) = 0
epoll_wait(13, [], 1024, 0) = 0
epoll_wait(13, [], 1024, 999) = 0
openat(AT_FDCWD, "/proc/self/stat", O_R
7 Comments
pjc50
Plus points for using strace. It's one of those debugging tools everyone know about for emergencies that can't be solved at a higher level, and a great convenience of using Linux. The Windows ETW system is much harder to use, and I'm not sure if it's even possible at all under OSX security.
GTP
Just from the title, I suspected that Steam Locomotive had something to do with it. So I quickly glanced through the article up to the point where the locomotive shows up. Sometimes there's the idea hanging in my mind to make a version called Slow Locomotive, where the train slows down every time you press ctrl-c.
bouke
So the real problem is that Jest just executes to whatever `sl` resolves. The fix they intent to release doesn't address that, but it tries to recognise the train steaming through. How is this acceptable behaviour from a test runner, as it looks like a disaster to happen. What if I have `alias sl=rm -rf /`, as one typically wants to have such a command close at hand?
fifticon
as a 30+y employed systems programmer, when I read a story like this, I get angry at the highly piled brittle system,not at the guy having sl installed. I am aware there exists a third option of not getting angry in the first place, but I hate opaque nonrobust crap. This smells like everything I hate about front-end tooling. ignorance and arrogance in perfect balance.
snovymgodym
The real story here is that the author and his coworker wasted a bunch of time tracking down this bug because their dev environment was badly set up.
> his system (MacOS) is not affected at all versus mine (Linux)
> nvm use v20 didn't fix it
If you are writing something like NodeJS, 99% of the time it will only ever be deployed server-side on Linux, most likely in a container.
As such, your dev environment should include a dev dockerfile and all of your work should be done from that container. This also has the added benefit of marginally sandboxing the thousands of mystery-meat NPM packages that you will no doubt be downloading from the rest of your machine.
There is zero reason to even mess with a "works on my machine" or a "try a different node version" situation on this kind of NodeJS project. Figure out your dependencies, codify them in your container definition, and move on. Oh, your tests work on MacOS? Great, it could not matter less because you're not deploying there.
Honestly, kind of shocking that a company like Cloudflare wouldn't have more standard development practices in place.
Kwpolska
Naming your source control tool after a common mistyping of ls is such a Facebook move.
mzs
this is good: https://github.com/rwos/gti