jank development update – Optimizing a ray tracer by Jeaye

Share This Article

Sed ut perspiciatis unde.

After the last post, which focused on optimizing jank’s sequences, I wanted to get jank running a ray tracer I had previously written in Clojure. In this post, I document what was required to start ray tracing in jank and, more importantly, how I chased down the run time in a fierce battle with Clojure’s performance.

Missing Clojure functions

Coming out of the last blog post, there were quite a few functions which the ray tracer required that jank did not yet have. A lot of this was tedium, but there are some interesting points.

Polymorphic arithmetic

In Clojure JVM, since everything can be an object, and Clojure’s dynamically typed, we can’t know what something like (+ a b) actually does. For example, it’s possible that either a or b is not a number, but it’s also possible that they’re an unboxed long or double, or a boxed Long or a Double, or maybe even a BigInteger or Ratio. Each of these will handle + slightly differently. Clojure (and now jank) handles this using a neat polymorphic design. In jank, it starts with this number_ops interface:

struct number_ops
{
  virtual number_ops const& combine(number_ops const&) const = 0;
  virtual number_ops const& with(integer_ops const&) const = 0;
  virtual number_ops const& with(real_ops const&) const = 0;

  virtual object_ptr add() const = 0;
  virtual object_ptr subtract() const = 0;
  virtual object_ptr multiply() const = 0;
  virtual object_ptr divide() const = 0;
  virtual object_ptr remainder() const = 0;
  virtual object_ptr inc() const = 0;
  virtual object_ptr dec() const = 0;
  /* ... and so on ... */
};

jank then has different implementations of this interface, like integer_ops and real_ops. The trick here is to use the correct “ops” for the combination of left and right. By left and right, I mean, when looking at the expression (+ a b), we see the left side, a, and the right side, b. So if they’re both integers, we can use the integer_ops, which returns more integers. But if one is an integer and the other is a real, we need need to return a real. You can see this in Clojure, since (+ 1 2) is 3, but (+ 1 2.0) is 3.0.

The way this comes together is something like this:

object_ptr add(object_ptr const l, object_ptr const r)
{ return with(left_ops(l), right_ops(r)).add(); }

You can see the Clojure source for this here and the jank source for this here.

Running the ray tracer

This ray tracer is a partial port of the very fun Ray tracing in one weekend project. It’s not meant to be fast; it’s a learning tool. However, it was a pure Clojure project I had laying around and seemed like a goal next goal for jank. The source code for both the jank and the Clojure versions are in this gist. They vary only in the math functions used; each one uses its host interop for them. Note that this is not the most idiomatic Clojure; it was written to work with the limitations of a previous iteration of jank, then somewhat upgraded. jank still doesn’t support some idiomatic things like using keywords as functions, so there are many calls to get.

After giving that code a tolerant scan, let’s take a look at the initial numbers.

First timing results

I’m ray tracing a tiny image. It’s only 10×6 pixels. However, in order to create this tiny image, we need to cast 265 rays. Here’s the image (there’s a larger one at the end of the post):

Now, the initial timing for this (using nanobench) for jank is here:

ms/op	op/s	err%	ins/op	bra/op	miss%	total	benchmark
797.49	1.25	2.1%	4,864,568,466.00	873,372,774.00	1.3%	8.61	`ray`

Just shy of 800 milliseconds. It’s less than a second, yes, but it’s also a trivially tiny image. Let’s see how Clojure does with the same code (using criterium.

; (out) Evaluation count : 12 in 6 samples of 2 calls.
; (out)              Execution time mean : 69.441424 ms
; (out)     Execution time std-deviation : 8.195639 ms
; (out)    Execution time lower quantile : 63.812357 ms ( 2.5%)
; (out)    Execution time upper quantile : 83.135203 ms (97.5%)
; (out)                    Overhead used : 11.626368 ns

Yikes. Clojure only takes 69 ms. jank takes 11.6x as long as Clojure to render the same image. That’s ok, though. We didn’t want this to be easy.

Profile, change, benchmark, repeat

My general understanding of the ray tracer was that we would spend most of our time:

Crunching numbers (using that polymorphic arithmetic I outlined above)
Creating maps (for rays, mainly)
Looking up data in maps (for rays, mainly)

Keywords with identity for equal and hash

Going into this, I knew one thing I could tackle right out of the gate. Map lookups were just using equality checks, but keywords can be compared using identity (pointer value), since they’re interned.

The changes: 067e0ee400bffea010cf8bfad0e81a75134f7771

ms/op	op/s	err%	ins/op	bra/op	miss%	total	benchmark
559.31	1.79	2.2%	3,317,176,052.00	613,115,971.00	1.7%	6.10	`ray`

Very nice. We’re coming out swinging. From 797.49 ms to 559.31 ms is a big win.

Compare generated code

After this, I didn’t have other optimizations planned, so the next reasonable step was to just take a look at the generated code from both jank and Clojure and try to spot some differences. For decompiling Clojure code, I’ll use the excellent clj-java-decompiler.

Given the three expected time sinks I listed above, I chose a function which had map creation, map lookups, and arithmetic. Should be the exact sort of thing we need to optimize. Here’s the function:

(defn vec3-scale [l n]
  {:r (* (get l :r) n)
   :g (* (get l :g) n)
   :b (* (get l :b) n)})

Let’s see the jank code for this first. I’ve annotated it for readability.

/* Our function was turned into a struct which implements
   some jank interfaces. */
struct vec3_scale579 : jank::runtime::object,
                       jank::runtime::behavior::callable,
                       jank::runtime::behavior::metadatable {
  jank::runtime::context &__rt_ctx;

  /* Vars referenced within the fn are lifted to members.
     In this case, * and get. */
  jank::runtime::var_ptr const _STAR_595;
  jank::runtime::var_ptr const get596;

  /* Constants are lifted to members. */
  jank::runtime::object_ptr const const600;
  jank::runtime::object_ptr const const594;
  jank::runtime::object_ptr const const598;

  /* Constructor which initializes all lifted vars and constants. */
  vec3_scale579(jank::runtime::context &__rt_ctx)
      : __rt_ctx{__rt_ctx},
        _STAR_595{__rt_ctx.intern_var("clojure.core", "*").expect_ok()},
        get596{__rt_ctx.intern_var("clojure.core", "get").expect_ok()},
        const600{__rt_ctx.intern_keyword("", "b", true)},
        const594{__rt_ctx.intern_keyword("", "r", true)},
        const598{__rt_ctx.intern_keyword("", "g", true)}
  { }

  /* This is where the actual jank code we wrote runs. */
  jank::runtime::object_ptr call
  (
    jank::runtime::object_ptr l,
    jank::runtime::object_ptr n
  ) const override {
    /* First we call (* (get l :r) n). We can see the calls to get and * here. */
    object_ptr call607;
    {
      object_ptr call608;
      { call608 = jank::runtime::dynamic_call(get596->get_root(), l, const594); }
      call607 = jank::runtime::dynamic_call(_STAR_595->get_root(), call608, n);
    }

    /* Same thing for calling (* (get l :g) n). */
    object_ptr call609;
    {
      object_ptr call610;
      { call610 = jank::runtime::dynamic_call(get596->get_root(), l, const598); }
      call609 = jank::runtime::dynamic_call(_STAR_595->get_root(), call610, n);
    }

    /* Same thing for calling (* (get l :b) n). */
    object_ptr call611;
    {
      object_ptr call612;
      { call612 = jank::runtime::dynamic_call(get596->get_root(), l, const600); }
      call611 = jank::runtime::dynamic_call(_STAR_595->get_root(), call612, n);
    }

    /* Finally, we create a map from all of this and return it. */
    auto const map613
    (
      jank::make_box<jank::runtime::obj::map>
      (std::in_place, const594, call607, const598, call609, const600, call611)
    );
    return map613;
  }
};

No big surprises there. Each var is dereferenced (with ->get_root()) when it’s used, since vars can change at any time, from any thread. Let’s see what Clojure generates for the same function.

/* Just like in jank, a class was generated for this function. */
public final class core$vec3_scale extends AFunction
{
    /* Just like in jank, the constants were lifted up to be members. */
    public static final Keyword const__0;
    public static final Keyword const__3;
    public static final Keyword const__4;
    
    public static Object invokeStatic(final Object l, final Object n)
    {
      /* Wait... no var derefs. Clojure didn't generate calls to clojure.core/*
         at all. It replaced them with calls to `Numbers.multiply`. Same with
         `RT.get`.*/
      return RT.mapUniqueKeys
      (
        const__0, Numbers.multiply(RT.get(l, const__0), n),
        const__3, Numbers.multiply(RT.get(l, const__3), n),
        const__4, Numbers.multiply(RT.get(l, const__4), n)
      );
    }
    
    @Override
    public Object invoke(final Object l, final Object n)
    { return invokeStatic(l, n); }
    
    /* We initialize the constants here, instead of in the function's
       constructor. This is a small performance win, since it only needs to
       happen once. */
    static {
      const__0 = RT.keyword(null, "r");
      const__3 = RT.keyword(null, "g");
      const__4 = RT.keyword(null, "b");
    }
}

Ok, that’s cheating! Clojure skipped past the actual implementation of clojure.core/* and generated something else in its place. This not only skips the var dereference, but it allows for calling right into a Java function which is going to be faster than something written in Clojure. It also skipped past the var for clojure.core/get and instead generated in a call to RT.get. While what I said about vars changing all the time is true, Clojure makes an exception for some functions within clojure.core.

Fine. I can cheat, too. The changes:

Optimize calls to get: 7258ddd3c58debf09070a71a4a1149bd1170d440
Optimize math calls: cde89658cec801ae54a4858d75e99ad9fde4bd2a

But before we look at the new time, there was another thing Clojure is doing. Clojure unboxes numbers whenever possible. Unboxing means to use automatic memory, rather than dynamic memory for storing something; this is frequently referred to as using “the stack”. Take a look at this function here:

(defn reflectance [cosine ref-idx]
  (let [r (/ (- 1.0 ref-idx)
             (+ 1.0 ref-idx))
        r2 (* r r)]
    (* (+ r2 (- 1.0 r2))
       (Math/pow (- 1.0 cosine) 5.0))))

Because (- 1.0 ref-idx) contains a double, Clojure can know that the whole expression will return a double. By then tracking how r is used, we can see if it requires boxing at all. In this case, r is only used with *, which doesn’t require boxing. So r can actually just be a double instead of a Double. The same applies for everything else. Take a look at the generated code.

public final class core$reflectance extends AFunction
{
  public static Object invokeStatic(final Object cosine, final Object ref_idx)
  {
    final double r = Numbers.minus(1.0, ref_idx) / Numbers.add(1.0, ref_idx);
    final double r2 = r * r;
    return Numbers.multiply(r2 + (1.0 - r2), Math.pow(Numbers.minus(1.0, cosine), 5.0));
  }
}

One last point about this: you may note that we’re not boxing the double we’re returning. The call to Numbers.multiply with two double inputs returns a double, but our function returns an Object. This works because Java supports auto-boxing and unboxing. In short, it will allow you to implicitly treat boxed and unboxed objects the same, injecting in the necessary code when it compiles. So, don’t be fooled, the final return value here is boxed.

To do a similar thing in jank, I broke it into some steps:

Generate more type info

Changes: 3e18c1025f3a6db2028d55f819594208197e1a78

The goal here is to use boxed types (integer_ptr, keyword_ptr, etc) during codegen, whenever we have them, rather than just object_ptr everywhere. Also, whenever possible, use auto to allow the propagation of richer types.

Add boxed typed overloads for math fns

Changes: 083f08374dd371dacf6b854decbada83d336fbd6

Even without unboxing, if we can know we’re adding a real_ptr and a real_ptr, for example, we can skip the polymorphic dance and get right at their internal data. We don’t do that yet, here, but we add the right overloads.

Extend the codegen to convey, for any expression, whether a box is needed

Changes: 626680ecdb09d9007378e5f17bd06f8bcfd285ff

This sets the stage for unboxed math, if conditions, let bindings, etc.

Remove polymorphism from boxed math ops, when possible

Changes: 87715f9a711e0e0bd667afce221ccbe4a85b5501

This utilizes the typed overloads to optimize the calls where we have a typed box like integer_ptr or real_ptr.

Add unboxed math overloads

Changes: d6d97e9a58f6a40dcc425b8da80c9bd69faa0886

Wrapping everything together, this allows expressions like (/ 1.0 (+ n 0.5)) to avoid boxing, at least for the (+ n 0.5). Conditions for if don’t require boxing, so something like (if (< y (/ 1.0 (+ n 0.5))) ...) wouldn't box at all.

I didn't implement unboxed let bindings, since that requires tracking binding usages to know

jank development update – Optimizing a ray tracer by Jeaye

jank development update – Optimizing a ray tracer by Jeaye

Share This Article

Newsletter

Missing Clojure functions

Polymorphic arithmetic

Running the ray tracer

First timing results

Profile, change, benchmark, repeat

Keywords with identity for equal and hash

Compare generated code

Generate more type info

Add boxed typed overloads for math fns

Extend the codegen to convey, for any expression, whether a box is needed

Remove polymorphism from boxed math ops, when possible

Add unboxed math overloads

HackTech

Leave a comment Cancel reply

Editor's Choice

jank development update – Optimizing a ray tracer by Jeaye

jank development update – Optimizing a ray tracer by Jeaye

Share This Article

Newsletter

Missing Clojure functions

Polymorphic arithmetic

Running the ray tracer

First timing results

Profile, change, benchmark, repeat

Keywords with identity for equal and hash

Compare generated code

Generate more type info

Add boxed typed overloads for math fns

Extend the codegen to convey, for any expression, whether a box is needed

Remove polymorphism from boxed math ops, when possible

Add unboxed math overloads

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter