Bioinformaticians build specialized scientific workflows that make raw experimental data interpretable for wet lab biologists. This software is difficult to develop and plagued with a unique set of challenges, from orchestrating large computers to managing enormous files and deploying sparsely documented academic tools and libraries.
While a myriad of frameworks and tooling have emerged to aid the development of these workflows, most bioinformaticians use Snakemake or Nextflow. Both of these frameworks are DSLs with dedicated language constructs that address common problems in bioinformatics: file I/O, dependency management and resource scheduling. Both are surrounded by burgeoning open-source communities that curate public examples of workflows for common assays and are actively developing new language features.
Bioinformaticians who use these languages are often part of multidisciplinary scientific organizations that create new challenges. The workflows they develop need to be run at scale to process new volumes of data flowing from modern high-throughput techniques. Workflow executions and data must be tracked for provenance and reproducibility in a central source of truth. A history of every analysis from the conception of a drug program will be required when filing an IND; results cannot be scattered on local machines.
Most importantly, data and analyses need to be made accessible to biologists, not just those with computational abilities, for rapid experimentation and integration into large scale studies. This empowers scientists to explore data and hypothesize independently, increasing the productivity of research teams.
This set of problems have led to bioinformatics platform solutions with managed cloud infrastructure and wet-lab friendly workflow interfaces. While industry Nextflow users have found a solution with Nextflow Tower, there remains no similar option for Bioinformaticians who prefer Snakemake.
And there is a need.
Today, LatchBio is releasing native support for Snakemake, offering graphical interfaces, managed infrastructure and downstream analysis solutions to this Python framework.
Before analyzing the technical tradeoffs between these competing languages and diving into the mechanics of the integration, our team would first like to extend our gratitude to Johannes Koester, the creator of the Snakemake project, as well as the broader community, for building and maintaining this framework. We love developing in snakemake and hope to see the community continue to thrive.
Disclaimer: LatchBio is not affiliated with Johannes or the Snakemake project.
Snakemake uses Python and Python is the language of bioinformatics. It is a modern, expressive and versatile language – easy to pick up for scripting and bootstrapping small projects but can be used to write industry-grade systems and library code (scanpy, biopython, scikit-learn).
Groovy, the language used in Nextflow, is an archaic scripting language developed for the JVM in the early 2000s that few have heard about or understand.
This alone makes a strong case for Snakemake. It is easier to find talent to build and maintain bioinformatics projects over a long period of time if the language they use is modern and widespread. Along with its ubiquity, Python’s growing popularity amongst software engineers has produced excellent dependency management tools and testing frameworks.
Nextflow’s DSL is overbearing and conflicts with Groovy in unintuitive ways. Nextflow sports channel operators that can be confused with native Groovy collections and have semantic meaning elsewhere in programming. Nextflow’s use of Groovy is also inconsistent – utility methods are not written in Groovy and such code is hard to test and maintain. Because it is neither a complete DSL (like WDL) but doesn’t use Groovy consistently, it makes for a confusing development experience.
The Snakemake DSL is minimal and reads as Python. Wherever the DSL isn’t used, Python is used. Utility methods are written in Python and this code is easily testable and maintainable as packages. It is easier to reason about and develop.
In Nextflow, structured data is passed around in tuples without static type checking. Nextflow also lacks typing on channels, making it difficult to share and reuse code without referring to the source.
Snakemake relies on wildcards and the ability to retrieve additional metadata with lambdas defined on parameters.
Workflow authoring is more efficient and less error prone in Snakemake. Debugging typos and errors in your workflow is much easier in Snakemake. Nextflow cannot identify which line in your file has the extra comma, while Snakemake leverages Python to do that for you.
Nextflow .command files are difficult to find and understand. Errors and logs are much easier to find and interpret in Snakemake.
Nextflow claims to have zero configuration, but really relies on multiple layers of config files. There are config files for processes, modules, workflows, labels, publishDir, parameters and so on. You can even have code inside your config files which creates bugs. With such nested configuration, it is hard to understand which config file is the source of issues.
Snakemake configuration is simple and not nested, so it does not lead to these problems.
You