This blog post has been residing in “draft” mode for quite a while now. I’ve finally decided to just publish it. As such, a few things might be a bit off, but… that’s life, isn’t it? I’m eager to hear what you think.
Back in 2020, my team at Wix launched a new internal product called CDN Stats. As the name suggests, CDN Stats is a platform that displays and aggregates data and statistics derived from Wix’s CDN.
That same year, I delivered a presentation about an experiment I conducted. This experiment, which involved rewriting a single module into a native Node.js add-on using Rust, resulted in a staggering 25x performance improvement.
CDN stats screenshot from the talk.
This platform empowers front-end developers at Wix by providing real-time data on the usage of their compiled assets by our customers, including:
- Platforms
- Downloads (per platform)
- Response size (transfer size)
These metrics allow front-end developers to identify which bundles need optimization and the urgency of these optimizations. It helped us rectify a critical issue that we might have otherwise overlooked: due to a human error, one of our primary JavaScript assets ballooned to 33MB(!) because it inlined uncompressed SVG without externalizing it or even dynamically importing it.
This project seems fairly straightforward, doesn’t it? All we’re doing is counting and summing up some values! Toss in some indices in the database, present it with a data table, and voila. But is it really that simple?
How does that work?
In order to populate the data table, we need to sift through the CDN logs. These are tab-separated values files (TSV) stored in Amazon S3. We’re talking about ~290k files per day, which can amount to 200GB.
So, every day, we download the TSV files from the previous day, parse them into meaningful information, and store this data in our database. We accomplish this by enqueuing each log file into a job queue. This strategy allows us to parallelize the process and utilize 25 instances to parse the log files within approximately 3 hours each day.
Then, every few hours, we generate an aggregation for the previous week. This process takes around ~1 hour and is primarily executed in the database using a MongoDB aggregation.
The MongoDB query was pretty intense.
You may be thinking, “Isn’t that a problem AWS Athena can solve?” and you’d be absolutely right. However, let’s set that aside for now. It simply wasn’t an option we were able to utilize at the time.
So What’s the Problem?
Parsing gigabytes of TSV files might seem like a tedious task. Investing so much time and resources on it… well, it’s not exactly a joy. To be brutally honest: having 25 Node.js instances running for three hours to parse 200GB of data seems like an indication that something isn’t quite right.
The solution that eventually worked was not our first attempt. Initially, we tried using the internal Wix Serverless framework. We processed all log files concurrently (and aggregated them using Promise.all
, isn’t that delightful?). However, to our surprise, we soon ran into out-of-memory issues, even when using utilities like p-limit
to limit ourselves to just two jobs in parallel. So, it was back to the drawing board for us.
Our second approach involved migrating the same code, verbatim, to the Wix Node.js platform, which runs on Docker containers in a Kubernetes cluster. We still encountered out-of-memory issues, leading us to reduce the number of files processed in parallel. Eventually, we got it to work—however, it was disappointingly slow. Processing a single day’s worth of data took more than a day. Clearly, this was not a scalable solution!
So, we decided to try the job queue pattern. By scaling our servers to 25 containers, we managed to achieve a more reasonable processing time. But, can we truly consider running 25 instances to be reasonable?
Perhaps it’s time to consider that JavaScript may be the problem. There might be certain issues that JavaScript simply can’t solve efficiently, especially when the work is primarily CPU and memory bound—areas where Node.js doesn’t excel.
But why is that so?
Node.js is a Garbage Collected VM
Node.js is a remarkable piece of technology. It fosters innovation and enables countless developers to accomplish tasks and deliver products. Unlike languages such as C and C++, JavaScript does not require explicit memory management. Everything is handled by the runtime—in this case, a VM known as V8.
V8 determines when to free up memory. This is how many languages are designed to optimize the developer experience. However, for certain applications, explicit memory usage is unavoidable.
Let’s analyze what (approximately) transpires when we execute our simplified TSV parsing code:
for await (const line of readlineStream) {
const fields = line.split('t');
const httpStatus = Number(fields[5]);
if (httpStatus < 200 || httpStatus > 299) continue;
records.push({
pathname: fields[7],
referrer: fields[8],
// ...
});
}
This code is quite straightforward. By using readline
, we iterate through lines in the file in a stream to circumvent memory bottlenecks caused by reading the entire file into memory. For every line, we split it by the tab character and push an item to an array of results. But what’s happening behind the scenes in terms of memory?
When line.split('t')
is invoked, we get a newly allocated array containing multiple items. Each item is a newly allocated string that occupies memory. This is the way Arr