While all JVM build tools take about the same amount of time for the initial build,
what is interesting is what happens for incremental builds. For example, below we
add a class dummy
line of code to Foo.scala
to force it to re-compile
the code and re-build the assembly:
> echo "class dummy" >> src/main/scala/foo/Foo.scala
> ./mill show assembly
".../out/assembly.dest/out.jar"
Total time: 1s
> sbt assembly
Built: .../target/scala-2.12/spark-app-assembly-0.1.jar
Total time: 20s
> ./mvnw package
Building jar: .../target/spark-app-0.1-jar-with-dependencies.jar
Total time: 22s
Here, we can see that Mill only took 1s
to re-build the assembly jar,
while SBT and Maven took the same ~20s that they took the first time the
jar was built. If you play around with it, you will see that the assembly jar
does contain classfiles associated with our newly-added code:
> jar tf out/assembly.dest/out.jar | grep dummy
foo/dummy.class
> jar tf target/scala-2.12/spark-app-assembly-0.1.jar | grep dummy
foo/dummy.class
> jar tf target/spark-app-0.1-jar-with-dependencies.jar | grep dummy
foo/dummy.class
You can try making other code changes, e.g. to the body of the spark program itself,
and running the output jar with java -jar
to see that your changes are indeed
taking effect. So the question you may ask is: how is it that Mill is able to
rebuild it’s output assembly jar in ~1s, while other build tools are
spending a whole ~20s rebuilding it?
Multi-Step Assemblies
The trick to Mill’s fast incremental rebuilding of assembly jars is to split the
assembly jar creation into three phases.
Typically, construction of an assembly jar is a slow single-step process. The
build tool has to take all third-party dependencies, local dependencies, and
the module being assembled, compress all their files and assemble them into a .jar
:
Mill instead does the assembly as a three-step process. In Mill, each of
third_party_libraries
, local_dependencies
, and current_module
are
added one-by-one to construct the final jar:
-
Third-party libraries are combined into an
upstream_thirdparty_assembly
in the first step, which is slow but rarely needs to be re-run -
Local upstream modules are combined with
upstream_thirdparty_assembly
into aupstream_assembly
in the second step, which needs to happen
more often but is fastest -
The current module is combined into
upstream_assembly
in the third step,
which is the fastest step but needs to happen the most frequently.
The key here is that the intermediate upstream_thirdparty_assembly
and
upstream_assembly
jar files can be re-used. This means that although any changes
to third_party_libraries
will still have to go through the slow process
of creating the assemblies from scratch:
In exchange, any changes to local_dependencies
can skip the slowest
upstream_thirdparty_assembly
step, and only run upstream_assembly
and assembly
:
And changes to current_module
can skip both upstream steps, only running the fast
assembly
step:
Building an assembly “clean” requires running all three steps and is just
as slow as the naive one-step assembly creation, as is the case where you change third
party dependencies. But in practice these scenarios tend to happen relatively infrequently:
perhaps once a day, or even less. In contrast, the scenarios where you are changing
code in local modules happens much more frequently, often several times a minute
while you are working on your code and adding println
s or tweaking its behavior.
Thus, although the worst case building an assembly with Mill is no better than other
tools, the average case can be substantially better with these optimizations.
Efficiently Updating Assembly Jars In Theory
One core assumption of the section above is that creating a new assembly jar
based on an existing one with additional files included is fast. This is not
true for every file format – e.g. .tar.gz
files are just as expensive to append to
as they are to build from scratch, as you need to de-compress and re-compress the whole
archive – but it is true for .jar
archives.
The key here is that .jar
archives are just .zip
files by another name, which
means two things:
-
Every file within the
.jar
is compressed individually, so adding additional
files does not need existing files to be re-compressed -
The zip index storing the offsets and metadata of each file within the jar is
stored at the end of the.jar
file, meaning it is straightforward to
over-write the index with additional files and then write a new index after
those new files without needing to move the existing files around the archive.
Visually, a Zip file laid out on disk looks something like this, with each
file e.g. Foo.class
or MANIFEST.MF
compressed separately: