Rob Patro
November 25, 2022
rust
programming
bioinformatics
tools
computer science
As has been well noted on the interwebs, I am a staunch advocate of the Rust
programming language. This is particularly true in my home domain of bioinformatics and computational biology.
In fact, so persistent am I in my advocacy for the use of Rust in bioinformatics applications that some of my colleagues
have claimed I am an overfit bot.
So, an obvious question a reader may ask is “why?”. Why am I so zealous in my advocacy for Rust and its use in
Bioinformatics? What benefits do I think it provides over the alternatives? What even are the alternatives?
Are there places in bioinformatics where I don’t think Rust is the right choice?
I intended this post to be an initial foray into addressing these questions, but quickly realized that, at least with
the current queue of things that need my attention, a sprawling, long-form blog post was not the way to go.
Therefore, I am hoppeful that this will become a series of posts, each one small and bite-sized, that together
lay out some of my thoughts on the use of Rust in bioinformatics, each touching on a smaller piece of the whole
picture. Yet, my prior blogging discipline is lacking, so I will make promises at this point about the length or
frequency of this series. So, without further ado, let’s get started.
In today’s post, I simply wish to define the problem space, that is usually implicit in my comments, when I advocate
for the use of Rust in bioinformatics.
Defining the problem space
When I advocate for the use of Rust as a language for developing bioinformatics methods and tools, I do so from my own
particular perspective. Bioinformatics is a giant field, with many different sub-disciplines, problem areas, and methodological
approaches. Perhaps, Rust is applicable everywhere here, but that is not the argument I mean to make. Rather, I would like
to advocate for the use of Rust when developing tools and methods for data-intensive, high-throughput analysis.
The types of applications I have in mind are sequencing indexing, read mapping and alignment, genome and transcriptome assembly, bulk and
single-cell RNA-seq and metagenomic quantification, etc. These applications are characterized by a need to process a large volume
of input data — often what many consider as raw input data. These applications have several charateristics that tie them together.
They often require reading and sometimes writing large quantities of data. They often require low-level and / or binary parsing and
interpretation of records. Given the problem sizes we encounter in practice, these are usually applications where memory usage is a
concern and so they often rely on efficient (in space and time) implementation specific data structures. Futher, in many such applicat