As Elixir’s Machine Learning (ML) ecosystem grows, many Elixir enthusiasts who wish to adopt the new machine learning libraries in their projects are stuck at a crossroads of wanting to move away from their existing ML stack (typically Python) while not having a clear path of how to do so. I would like to take some time to talk to WHY I believe now is a good time to start porting over Machine Learning code into Elixir, and HOW I went about doing just this for two libraries I wrote: EXGBoost (from Python XGBoost) and Mockingjay (from Python Hummingbird).
Why is Python not Sufficient?
There’s a common saying in programming languages that no language is perfect, but that different languages are suited for different jobs. Languages such as C, Rust, and now even Zig are known for their targeting systems development, while languages such as C++, C#, and Java are more commonly used for application development, and obviously there are the web languages such as JavaScript/TypeScript, PHP, Ruby (on Rails), and more. There are gradations to these rules of course, but more often than not there are good reasons that languages tend to exist within the confines of particular use cases.
Languages such as Elixir and Go tend to be used in large distributed systems because they place an emphasis on having great support for common concurrency patterns, which can come at the cost of supporting other domains. Go, for example, has barely (if any?) support for machine learning libraries, but it’s also not trying to cater to that as a target domain. For a long time,e the same could have been said about Elixir, but over the past two or so years, there has been a massive concerted push from the Elixir community to not only have support for machine learning, but to push the envelope with the maintaining state of the art libraries that are beginning to compete with the other dominant machine learning languages – namely Python.
Python has long been the gold standard in the realm of machine learning. The breadth of libraries and the low entry barrier makes Python a great language to work with, but it does create a bit of a bottleneck. Any application that wishes to integrate machine learning has historically had only a couple of options: have a Python component or reach into the underlying libraries that power much of the Python libraries directly. Despite all the good parts of Python I mentioned before, speed and support for concurrency are not on that list. Elixir-Nx is striving to give another option – an option that can take advantage of the native distributed support that Elixir and the BEAM VM have to offer. Nx’s Nx.Serving
construct is a drop-in solution for serving distributed machine-learning models.
How to Proceed
Sean Moriarity, the co-creator of Nx, creator of Axon, and author of Machine Learning in Elixir, has talked many times about how the initial creation of Nx and Axon involved hours upon hours of reading source code from reference implementations of libraries in Python and C++, namely the Tensorflow source code. While I was writing EXGBoost and Mockingjay, much of my time, especially towards the beginning, was spent referencing the Python and C++ implementations of the original libraries. This builds a great fundamental understanding of the libraries as well as taught me how to identify patterns in Python and C++ and identify the Elixir pattern that could express the same ideas. This skill is invaluable, and the better I got at it the faster I could write. Below is a summary and key takeaways from my process of porting Python / PyTorch to Elixir / Nx.
Workflow Overview
Before I get to the examples from the code bases, I would like to briefly explain the high-level cyclical workflow I established while working on this effort, and what I would recommend to anyone pursuing a similar endeavor.
Understand the Macro System
Much like how there’s a common strategy to reading comprehension which involves reading through the entire document once to get a high-level understanding and then doing subsequent shorter reads to gain more in-depth understanding with the added context of the entire piece, you can consider doing the same when reading code. My first step was to follow the logical flow from the call of hummingbird.ml.convert
to the final result. You can use tools such as function tracers and callgraph generators to accelerate this part of the process, or manually trace depending on the extent of the codebase. I felt in my case that it was manageable to trace myself.
Read the Documentation
Once you have a general understanding of the flow and process of the original system, you can start referring to the documentation for some additional context. In my case, this lead me to the academic paper Taming Model Serving Complexity, Performance and Cost: A Compilation to Tensor Computations Approach, which was the underlying ground work and basis for their implementation. I could write a whole other blog post about the process of transcribing algorithms and code from academic papers and pseudocode, but for now just know that these are some of the most important pieces you can refer to while re-implementing or porting over a piece of source code.
Read the Source Code in Detail
This is the point in which you want to disambiguate the higher-level ideas from the first step and really gain a fine, high-resolution understanding of what is happening. There might even be some points in which you need to deconflict the source code with its documentation and/or paper reference. In those cases, the source code almost always wins, and if not, then you likely have a bug report you can file. If you see things you don’t fully understand, you don’t necessarily need to address it here, but you should make note of it and keep it in mind while working in case new details help resolve it.
Implement the New Code
At this point, you should feel comfortable enough to start implementing the code. I found this to be a very iterative process, meaning I would think I had a grasp on something, then would start working on implementing it, then would realize I did not understand it as well as I had thought and would work my way back through the previous steps.
Example
Class vs. Behaviour
As a result of the reading and comprehension I did of the Hummingbird code base, I realized fairly early on that my library was going to have some key differences. One of the main reasons for these differences was the fact that the Hummingbird c