Fast Incremental JVM Assembly Jar Creation with Mill by lihaoyi

Share This Article

Sed ut perspiciatis unde.

While all JVM build tools take about the same amount of time for the initial build,
what is interesting is what happens for incremental builds. For example, below we
add a class dummy line of code to Foo.scala to force it to re-compile
the code and re-build the assembly:

> echo "class dummy" >> src/main/scala/foo/Foo.scala

> ./mill show assembly
".../out/assembly.dest/out.jar"
Total time: 1s

> sbt assembly
Built: .../target/scala-2.12/spark-app-assembly-0.1.jar
Total time: 20s

> ./mvnw package
Building jar: .../target/spark-app-0.1-jar-with-dependencies.jar
Total time: 22s

Here, we can see that Mill only took 1s to re-build the assembly jar,
while SBT and Maven took the same ~20s that they took the first time the
jar was built. If you play around with it, you will see that the assembly jar
does contain classfiles associated with our newly-added code:

> jar tf out/assembly.dest/out.jar | grep dummy
foo/dummy.class

> jar tf target/scala-2.12/spark-app-assembly-0.1.jar | grep dummy
foo/dummy.class

> jar tf target/spark-app-0.1-jar-with-dependencies.jar | grep dummy
foo/dummy.class

You can try making other code changes, e.g. to the body of the spark program itself,
and running the output jar with java -jar to see that your changes are indeed
taking effect. So the question you may ask is: how is it that Mill is able to
rebuild it’s output assembly jar in ~1s, while other build tools are
spending a whole ~20s rebuilding it?

Multi-Step Assemblies

The trick to Mill’s fast incremental rebuilding of assembly jars is to split the
assembly jar creation into three phases.

Typically, construction of an assembly jar is a slow single-step process. The
build tool has to take all third-party dependencies, local dependencies, and
the module being assembled, compress all their files and assemble them into a .jar:

third_party_libraries third_party_libraries

assembly (slow) assembly (slow)

third_party_libraries->assembly (slow)

local_dependencies local_dependencies

local_dependencies->assembly (slow)

current_module current_module

current_module->assembly (slow)

Mill instead does the assembly as a three-step process. In Mill, each of
third_party_libraries, local_dependencies, and current_module are
added one-by-one to construct the final jar:

third_party_libraries third_party_libraries

upstream_thirdparty_assembly (slow) upstream_thirdparty_assembly (slow)

third_party_libraries->upstream_thirdparty_assembly (slow)

upstream_assembly (fast) upstream_assembly (fast)

upstream_thirdparty_assembly (slow)->upstream_assembly (fast)

assembly (fast) assembly (fast)

upstream_assembly (fast)->assembly (fast)

local_dependencies local_dependencies

local_dependencies->upstream_assembly (fast)

current_module current_module

current_module->assembly (fast)

Third-party libraries are combined into an upstream_thirdparty_assembly
in the first step, which is slow but rarely needs to be re-run
Local upstream modules are combined with upstream_thirdparty_assembly
into a upstream_assembly in the second step, which needs to happen
more often but is fastest
The current module is combined into upstream_assembly in the third step,
which is the fastest step but needs to happen the most frequently.

The key here is that the intermediate upstream_thirdparty_assembly and
upstream_assembly jar files can be re-used. This means that although any changes
to third_party_libraries will still have to go through the slow process
of creating the assemblies from scratch:

third_party_libraries third_party_libraries

upstream_thirdparty_assembly (slow) upstream_thirdparty_assembly (slow)

third_party_libraries->upstream_thirdparty_assembly (slow)

upstream_assembly (fast) upstream_assembly (fast)

upstream_thirdparty_assembly (slow)->upstream_assembly (fast)

assembly (fast) assembly (fast)

upstream_assembly (fast)->assembly (fast)

local_dependencies local_dependencies

local_dependencies->upstream_assembly (fast)

current_module current_module

current_module->assembly (fast)

In exchange, any changes to local_dependencies can skip the slowest
upstream_thirdparty_assembly step, and only run upstream_assembly and assembly:

third_party_libraries third_party_libraries

upstream_thirdparty_assembly (slow) upstream_thirdparty_assembly (slow)

third_party_libraries->upstream_thirdparty_assembly (slow)

upstream_assembly (fast) upstream_assembly (fast)

upstream_thirdparty_assembly (slow)->upstream_assembly (fast)

assembly (fast) assembly (fast)

upstream_assembly (fast)->assembly (fast)

local_dependencies local_dependencies

local_dependencies->upstream_assembly (fast)

current_module current_module

current_module->assembly (fast)

And changes to current_module can skip both upstream steps, only running the fast
assembly step:

third_party_libraries third_party_libraries

upstream_thirdparty_assembly (slow) upstream_thirdparty_assembly (slow)

third_party_libraries->upstream_thirdparty_assembly (slow)

upstream_assembly (fast) upstream_assembly (fast)

upstream_thirdparty_assembly (slow)->upstream_assembly (fast)

assembly (fast) assembly (fast)

upstream_assembly (fast)->assembly (fast)

local_dependencies local_dependencies

local_dependencies->upstream_assembly (fast)

current_module current_module

current_module->assembly (fast)

Building an assembly “clean” requires running all three steps and is just
as slow as the naive one-step assembly creation, as is the case where you change third
party dependencies. But in practice these scenarios tend to happen relatively infrequently:
perhaps once a day, or even less. In contrast, the scenarios where you are changing
code in local modules happens much more frequently, often several times a minute
while you are working on your code and adding printlns or tweaking its behavior.
Thus, although the worst case building an assembly with Mill is no better than other
tools, the average case can be substantially better with these optimizations.

Efficiently Updating Assembly Jars In Theory

One core assumption of the section above is that creating a new assembly jar
based on an existing one with additional files included is fast. This is not
true for every file format – e.g. .tar.gz files are just as expensive to append to
as they are to build from scratch, as you need to de-compress and re-compress the whole
archive – but it is true for .jar archives.

The key here is that .jar archives are just .zip files by another name, which
means two things:

Every file within the .jar is compressed individually, so adding additional
files does not need existing files to be re-compressed
The zip index storing the offsets and metadata of each file within the jar is
stored at the end of the .jar file, meaning it is straightforward to
over-write the index with additional files and then write a new index after
those new files without needing to move the existing files around the archive.

Visually, a Zip file laid out on disk looks something like this, with each
file e.g. Foo.class or MANIFEST.MF compressed separately:

zip …thirdparty dependencies… MANIFEST

Fast Incremental JVM Assembly Jar Creation with Mill by lihaoyi

Fast Incremental JVM Assembly Jar Creation with Mill by lihaoyi

Share This Article

Newsletter

Multi-Step Assemblies

Efficiently Updating Assembly Jars In Theory

HackTech

Leave a comment Cancel reply

Editor's Choice

Fast Incremental JVM Assembly Jar Creation with Mill by lihaoyi

Fast Incremental JVM Assembly Jar Creation with Mill by lihaoyi

Share This Article

Newsletter

Multi-Step Assemblies

Efficiently Updating Assembly Jars In Theory

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter