In this episode of jank’s development updates, we follow an exciting few weekends as I was digging deep into Clojure’s sequence implementation, building jank’s equivalent, and then benchmarking and profiling in a dizzying race to the bottom.
Introduction
Not expecting a rabbit hole, I was originally surprised at how many allocations are involved in a normal sequence iteration in Clojure and thought to optimize that in jank. In fact, Clojure allocates a new sequence for every element over which it iterates!
Clojure’s interface for sequences looks like this (link):
public interface ISeq extends IPersistentCollection
{
/* Returns the current front element of the sequence. */
Object first();
/* Returns a *new* sequence where the first is the next element, or nil. */
ISeq next();
/* Returns a *new* sequence where the first is the next element, or (). */
ISeq more();
/* Returns a *new* sequence where o is placed before the current first. */
ISeq cons(Object o);
}
In particular, we’re interested in next
. To look at how this is implemented, let’s see an example from Clojure’s APersistentVector
sequence (link):
public ISeq next()
{
/* This seq has a member for the vector itself, `v`, and the current offset
into it, `i`. Each iteration makes a new sequence with `i` incremented. */
if(i + 1 < v.count())
{ return new APersistentVector.Seq(v, i + 1); }
return null;
}
This really surprised me, and I figured there must be a lot of cases where a sequence is only referenced in one place, so it can be changed in place in order to avoid allocations. This could potentially save millions of allocations in a typical program. For example, with something like:
(apply str [1 2 3 4 5 6 7 8 9 10])
The exact APersistenVector.Seq
from above will be used here, resulting in 10 allocations as apply
iterates through the sequence to build the arguments for str
. So I built something like that in jank’s sequence API. It looks like this:
struct sequence : virtual object, seqable
{
using sequence_ptr = detail::box_type<sequence>;
virtual object_ptr first() const = 0;
virtual sequence_ptr next() const = 0;
/* Each call to next() allocates a new sequence_ptr, since it's polymorphic. When iterating
* over a large sequence, this can mean a _lot_ of allocations. However, if you own the
* sequence_ptr you have, typically meaning it wasn't a parameter, then you can mutate it
* in place using this function. No allocations will happen.
*
* If you don't own your sequence_ptr, you can call next() on it once, to get one you
* do own, and then next_in_place() on that to your heart's content. */
virtual sequence_ptr next_in_place() = 0;
/* Note, no cons here, since that's not implemented yet. */
};
The usage of next_in_place
for all sequence traversals in jank meant that, at most, one allocation was needed for an iteration of any length. In jank’s case, that meant the same (apply str [1 2 3 4 5 6 7 8 9 10])
went from 32 sequence allocations to only 3.
That’s a huge win. Right?
The rabbit hole
So then I benchmarked. How long does jank take to apply
that same vector of numbers to str
? How much did I save?
jank
Note, this benchmark fn in jank is using nanobench. Since jank doesn’t have working macros yet, the benchmark also includes invoking the function, which is not the case for Clojure.
(benchmark "apply"
(fn* []
(apply str [1 2 3 4 5 6 7 8 9 10])))
Before the next_in_place
change (ns/op
is the primary value of interest):
ns/op | op/s | err% | ins/op | branch/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|
7,215.69 | 138,586.80 | 0.3% | 30,505.02 | 8,329.00 | 0.4% | 0.04 | apply |
After the next_in_place
change:
ns/op | op/s | err% | ins/op | branch/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|
6,191.03 | 161,523.93 | 0.2% | 25,375.02 | 7,220.00 | 0.4% | 0.04 | apply |
Nice! That’s about 1,100 ns we trimmed off there, by removing the extra allocations. I’m curious, though, how long does Clojure take to do the same thing?
Clojure
user=> (quick-bench (apply str [1 2 3 4 5 6 7 8 9 10]))
; Evaluation count : 629958 in 6 samples of 104993 calls.
; Execution time mean : 938.749444 ns
; Execution time std-deviation : 27.891701 ns
; Execution time lower quantile : 923.094673 ns ( 2.5%)
; Execution time upper quantile : 987.172459 ns (97.5%)
; Overhead used : 14.193132 ns
Oh no. Clojure takes about 939 ns, while jank, even with the optimized interface, takes 6,191 ns. We’re not even close!
Profile, change, benchmark, repeat
Firstly, let’s compare the actual code being benchmarked here.
Generated code
Clojure
There is an excellent tool, which has proved useful so many times during jank’s development, called clojure-goes-fast/clj-java-decompiler. With just the following:
user=> (require '[clj-java-decompiler.core :refer [decompile]])
user=> (decompile (apply str [1 2 3 4 5 6 7 8 9 10]))
We get:
public class cjd__init
{
public static final Var const__0;
public static final Var const__1;
public static final AFn const__12;
public static void load()
{
((IFn)cjd__init.const__0.getRawRoot()).invoke
(
cjd__init.const__1.getRawRoot(),
cjd__init.const__12
);
}
public static void __init0()
{
const__0 = RT.var("clojure.core", "apply");
const__1 = RT.var("clojure.core", "str");
const__12 = (AFn)RT.vector(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L);
}
static
{
__init0();
// A bit more redacted here ...
}
So, to understand this, note that our expression (apply str [1 2 3 4 5 6 7 8 9 10])
was turned into a Java class. The constants for apply
and str
, which are vars, were lifted, and our vector constant was also lifted. Those are the three const__
members of the class, which are statically initialized. The actual code which does our apply
is in load
. We can see, it basically does the following, if we sanitize the lifted constants:
apply.getRawRoot().invoke(str.getRawRoot(), vec);
Clojure’s generated code seems optimal. The vars and vector are both lifted and the load
function only gets their roots and invokes (roots can’t reasonably be cached, especially during interactive programming, since vars can be redefined at any time, including from other threads). Let’s see what jank is generating for this, to ensure it’s equally optimized.
jank
struct gen166 : jank::runtime::object,
jank::runtime::pool_item_base<gen166>,
jank::runtime::behavior::callable,
jank::runtime::behavior::metadatable {
// Some bits redacted ...
jank::runtime::context &__rt_ctx;
jank::runtime::var_ptr const str155;
jank::runtime::var_ptr const apply154;
jank::runtime::object_ptr const const165;
jank::runtime::object_ptr const const164;
jank::runtime::object_ptr const const163;
jank::runtime::object_ptr const const162;
jank::runtime::object_ptr const const161;
jank::runtime::object_ptr const const160;
jank::runtime::object_ptr const const159;
jank::runtime::object_ptr const const158;
jank::runtime::object_ptr const const157;
jank::runtime::object_ptr const const156;
gen166(jank::runtime::context &__rt_ctx)
: __rt_ctx{__rt_ctx},
str155{__rt_ctx.intern_var("clojure.core", "str").expect_ok()},
apply154{__rt_ctx.intern_var("clojure.core", "apply").expect_ok()},
const165{jank::runtime::make_box<jank::runtime::obj::integer>(10)},
const164{jank::runtime::make_box<jank::runtime::obj::integer>(9)},
const163{jank::runtime::make_box<jank::runtime::obj::integer>(8)},
const162{jank::runtime::make_box<jank::runtime::obj::integer>(7)},
const161{jank::runtime::make_box<jank::runtime::obj::integer>(6)},
const160{jank::runtime::make_box<jank::runtime::obj::integer>(5)},
const159{jank::runtime::make_box<jank::runtime::obj::integer>(4)},
const158{jank::runtime::make_box<jank::runtime::obj::integer>(3)},
const157{jank::runtime::make_box<jank::runtime::obj::integer>(2)},
const156{jank::runtime::make_box<jank::runtime::obj::integer>(1)}
{ }
jank::runtime::object_ptr call() const override
{
using namespace jank;
using namespace jank::runtime;
object_ptr call167;
{
auto const &vec168(jank::runtime::make_box<jank::runtime::obj::vector>(
const156, const157, const158, const159, const160, const161, const162,
const163, const164, const165));
call167 = jank::runtime::dynamic_call(apply154->get_root(), str155->get_root(), vec168);
}
return call167;
}
};
The outline here is similar. jank generates a struct from the expression. We have constants lifted to members, and we initialize those in the struct’s constructor. Then we have a call
function which does our work. But, looking at our call
function, we can see it’s creating our vector, too; jank only lifted the numbers, not the whole vector! Let’s change that.
The changes: 2a8014dfae6e57273983cee8f2c7f78a2be7fe73
ns/op | op/s | err% | ins/op | branch/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|
4,671.71 | 214,054.61 | 0.2% | 23,798.02 | 6,721.00 | 0.3% | 0.03 | apply |
Nice! We’ve gone from 6,191 ns to 4,671 ns by ensuring we lift the vector out. Our generated call
function just looks like this now:
jank::runtime::object_ptr call() const override
{
using namespace jank;
using namespace jank::runtime;
object_ptr call169 = jank::runtime::dynamic_call
(
apply154->get_root(),
str155->get_root(),
const166
);
return call169;
}
Very similar to the generated Clojure load
function! But still over 4x slower. We know the generated code is good, so let’s dig deeper into what’s going on when we call these functions.
Sequence lengths
If we follow how apply
works on the C++ side, it looks like this:
object_ptr apply_to(object_ptr const &source, object_ptr const &args)
{
auto const &s(args->as_seqable()->seq());
auto const length(detail::sequence_length(s, max_params + 1));
switch(length)
{
case 0:
return dynamic_call(source);
case 1:
return dynamic_call(source, s->first());
case 2:
return dynamic_call(source, s->first(), s->next_in_place()->first());
// more redacted ...
}
}
We need to know how many arguments we’re calling the function with, by getting the sequence length, and then we build out the correct call accordingly. Clojure does the same thing here. Right now, detail::sequence_length
is O(n), but our sequences know their length. Let’s use that and add a Countable
behavior to get an O(1) length check here. The new function looks like:
size_t sequence_length(behavior::sequence_ptr const &s, size_t const max)
{
if(s == nullptr)
{ return 0; }
/* This is allow us to be O(1). */
else if(auto const * const c = s->as_countable())
{ return c->count(); }
size_t length{ 1 };
for(auto i(s->next()); i != nullptr && length < max; i = i->next_in_place())
{ ++length; }
return length;
}
The changes: 0ec065d8ed6a986690c1055ab29d91cc50680921
ns/op | op/s | err% | ins/op | branch/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|
4,320.42 | 231,459.17 | 0.4% | 21,956.02 | 6,150.00 | 0.4% | 0.03 | apply |
From 4,671 ns to 4,320 ns. That’s good progress, but we’re still a long way off from Clojure’s 939 ns.
Packed variadic args
Once we get into dynamic_call
, after apply_to
, we need to check if the function is variadic and then pack some args accordingly. Let’s take a look at our str
function, so we can see which path will be taken.
(def str
(fn*
([]
"")
([o]
(native/raw "__value = make_box(#{ o }#->to_string());"))
([o & args]
(native/raw "std::string ret(#{ o }#->to_string().data);
auto const * const l(#{ args }#->as_list());
for(auto const &elem : l->data)
{ ret += elem->to_string().data; }
__value = make_box(ret);"))))
Ok, so when we apply [1 2 3 4 5 6 7 8 9 10]
to this, we’ll use the variadic arity. The o
param will be 1
and then args
will be (2 3 4 5 6 7 8 9 10)
. The current implementation passes in a list for args
, which means that packing those 9 numbers requires 9 allocations. However, we can see that Clojure uses an ArraySeq
for packed arguments instead, here. Let’s do that.
The ch