How a Single Line of Code Made a 24-Core Server Slower Than a Laptop by Ygg2

Share This Article

Sed ut perspiciatis unde.

December 31, 2021

Imagine you wrote a program for a pleasingly parallel problem,
where each thread does its own independent piece of work,
and the threads don’t need to coordinate except joining the results at the end.
Obviously you’d expect the more cores it runs on, the faster it is.
You benchmark it on a laptop first and indeed you find out it scales
nearly perfectly on all of the 4 available cores. Then you run it on a big, fancy, multiprocessor
machine, expecting even better performance, only to see it actually runs slower
than the laptop, no matter how many cores you give it. Uh. That has just happened to me recently.

I’ve been working recently on a Cassandra benchmarking tool Latte
which is probably the most efficient Cassandra benchmarking tool you can get, both in terms of CPU use and memory use.
The whole idea is very simple: a small piece of code generates data and executes a bunch of
asynchronous CQL statements against Cassandra.
Latte calls that code in a loop and records how long each iteration took.
Finally, it makes a statistical analysis and displays it in various forms.

Benchmarking seems to be a very pleasant problem to parallelize.
As long as the code under benchmark is stateless, it can be fairly trivially called
from multiple threads. I’ve blogged about how to achieve that in Rust
already here and
here.

However, at the time I wrote those earlier blog posts, Latte’s workload definition capabilities were ~~nonexistent~~ quite limited.
It came with only two predefined, hardcoded workloads, one for reading and another one for writing.
There were a few things you could parameterize, e.g. the number or the sizes of table columns, but nothing really fancy.
No secondary indexes. No custom filtering clauses. No control over the CQL text. Really nothing.
So, overall, Latte at that time was more of a proof-of-concept rather than a universal tool for doing real work.
Surely, you could fork it and write a new workload in Rust, then compile everything from source. But who wants to waste time
on learning the internals of a niche benchmarking tool?

Rune scripting

So the last year, in order to be able to measure the performance of storage attached indexes in Cassandra,
I decided to integrate Latte with a scripting engine that would allow me to easily define workloads without recompiling
the whole program. After playing a bit with embedding CQL statements in TOML config files (which turned out to be both messy and limited at the same time),
through having some fun with embedding Lua (which is probably great in C world, but didn’t play so nice with Rust as I expected, although it kinda worked),
I eventually ended up with a design similar to that of sysbench but
with an embedded Rune interpreter instead of Lua.

The main selling points of Rune that convinced me were painless Rust integration and support for async code.
Thanks to async support, the user can execute CQL statements directly in the workload scripts, leveraging the asynchronous nature
of the Cassandra driver. Additionally, the Rune team is amazingly helpful and removed anything that blocked me in virtually no time.

Here is an example of a complete workload that measures performance of selecting rows by random keys:

const ROW_COUNT = latte::param!("rows", 100000);

const KEYSPACE = "latte";
const TABLE = "basic";

pub async fn schema(ctx) {
    ctx.execute(`CREATE KEYSPACE IF NOT EXISTS ${KEYSPACE} 
                    WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }`).await?;
    ctx.execute(`CREATE TABLE IF NOT EXISTS ${KEYSPACE}.${TABLE}(id bigint PRIMARY KEY)`).await?;
}

pub async fn erase(ctx) {
    ctx.execute(`TRUNCATE TABLE ${KEYSPACE}.${TABLE}`).await?;
}

pub async fn prepare(ctx) {
    ctx.load_cycle_count = ROW_COUNT;
    ctx.prepare("insert", `INSERT INTO ${KEYSPACE}.${TABLE}(id) VALUES (:id)`).await?;
    ctx.prepare("select", `SELECT * FROM ${KEYSPACE}.${TABLE} WHERE id = :id`).await?;
}

pub async fn load(ctx, i) {
    ctx.execute_prepared("insert", [i]).await?;
}

pub async fn run(ctx, i) {
    ctx.execute_prepared("select", [latte::hash(i) % ROW_COUNT]).await?;
}

You can find more info on how to write those scripts in the README.

Benchmarking the benchmarking program

Although the scripts are not JIT-compiled to native code yet, they are acceptably fast, and thanks to the limited amount of code they
typically contain, they don’t show up at the top of the profile. I’ve empirically found that the overhead of Rust-Rune FFI was lower than that of
Rust-Lua provided by mlua, probably due to the safety checks employed by mlua.

Initially, to assess the performance of the benchmarking loop, I created an empty script:

pub async fn run(ctx, i) {
}

Even though there is no function body there, the benchmarking program needs to do some work to actually run it:

schedule N parallel asynchronous invocations using buffer_unordered
setup a fresh local state (e.g. stack) for the Rune VM
invoke the Rune function, passing the parameters from the Rust side
measure the time it took to complete each returned future
collect logs, update HDR histograms and compute other statistics
and run all of that on M threads using Tokio threaded scheduler

The results on my old 4-core laptop with Intel Xeon E3-1505M v6 locked at 3 GHz looked very promising:

Because there are 4 cores, the throughput increases linearly up to 4 threads. Then it increases slightly more
up to 8 threads, thanks to hyper-threading that squeezes a bit more performance out of each core. Obviously there is no
performance improvement beyond 8 threads, because all CPU resources are saturated at this point.

I was also satisfied with the absolute numbers I got. A few million of empty calls per second on a laptop sounds like
the benchmarking loop is lightweight enough to not cause significant overhead in real measurements. A local Cassandra server
launched on the same laptop can only do about 200k requests per second when fully loaded and that only if those requests are
stupidly simple and all the data fits in memory.

By the way, after adding some real code for data generation in the body, but with no calls to the database, as expected
everything got proportionally slower, but not more than 2x slower, so it was still in a “millions ops per second” range.

That was easy. I could have stopped here and announce victory. However, I was curious how fast it could go if tried on a bigger machine with more cores.

Running an empty loop on 24 cores

A server with two Intel Xeon CPU E5-2650L v3 processors, each with 12 cores running at 1.8 GHz should be obviously a lot faster than an old 4-core laptop, shouldn’t it?
Well, maybe with 1 thread it would be slower because of lower CPU frequency (3 GHz vs 1.8 GHz), but it should make up for that by having many more cores.

Let the numbers speak for themselves:

You’ll agree there is something wrong here. Two threads are better than one… and that’s basically it.
I couldn’t squeeze more throughput than about 2 million calls per second,
which was about 4x worse than the throughput I got on the laptop.
Either the server was a lemon or my program had a serious scalability issue.

How a Single Line of Code Made a 24-Core Server Slower Than a Laptop by Ygg2

How a Single Line of Code Made a 24-Core Server Slower Than a Laptop by Ygg2

Share This Article

Newsletter

Rune scripting

Benchmarking the benchmarking program

Running an empty loop on 24 cores

Investigat

HackTech

Leave a comment Cancel reply

Editor's Choice

How a Single Line of Code Made a 24-Core Server Slower Than a Laptop by Ygg2

How a Single Line of Code Made a 24-Core Server Slower Than a Laptop by Ygg2

Share This Article

Newsletter

Rune scripting

Benchmarking the benchmarking program

Running an empty loop on 24 cores

Investigat

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter