Testing Distributed Systems by ngaut

Share This Article

Sed ut perspiciatis unde.

List of resources on testing distributed systems curated by Andrey Satarin (@asatarin).
If you are interested in my other stuff, checkout talks page.
For any questions or suggestions you can reach out to me on Twitter (@asatarin) or LinkedIn.

Contents

Overview of testing approaches
Specific approaches in different distributed systems
- Amazon Web Services
- Netflix
- Twitter
- Cassandra
- ScyllaDB
- VoltDB
- MemSQL
- CockroachLabs (CockroachDB)
- PingCap (TiDB)
- MongoDB
- Cloudera
- FoundationDB
- Wallaroo Labs
- Google
- Microsoft
- Dropbox
- Atomix Copycat
- Onyx
- LinkedIn
- Druid.io
- Salesforce
- InfluxDB
- Shopify
- Confluent (Kafka)
- Elastic (Elasticsearch)
- YugabyteDB
- FaunaDB
- Hazelcast
- Basho (Riak)
- CoreOS (etcd)
- Red Planet Labs
- Coil (TigerBeetle)
Single node systems
- SQLite
- Sled
- Clickhouse
Tools

Overview of testing approaches

Research Papers

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems — Great overview of how even simple testing can help a lot, you just need right focus
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems — study of actual bugs in different popular distributed systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper
and Flume)
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems — comprehensive taxonomy of bugs in distributed systems (Cassandra, Hadoop MapReduce, HBase, ZooKeeper)
An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems — based on bug database from “What Bugs Live in the Cloud?” paper reseachers focus specifically on crash recovery bugs in Hadoop MapReduce, HBase, Cassandra, ZooKeeper. There is review of this paper by Murat Demirbas in his blog.
Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions — study of several distributed systems (Redis, ZooKeeper, MongoDB, Cassandra, Kafka, RethinkDB) on how fault tolerant they are to data corruption and read/write errors
An empirical study on the correctness of formally verified distributed systems — study of bugs in formally verified distributed systems. Analysis includes Miscrosoft’s IronFleet distributed key-value store built from formal model.
The Case for Limping-Hardware Tolerant Clouds — research on effect of limping hardware on performance of a distributed systems (aka limplock), see also great blog post by Dan Luu on a similiar topic Distributed systems: when limping hardware is worse than dead hardware
Early detection of configuration errors to reduce failure damage — why and how to test configuration files of your system
Why Is Random Testing Effective for Partition Tolerance Bugs? — just what it says in a title, authors try to explain why random testing (Jepsen) is effective and introduce notions of test coverage relating to network partition, see also “The Morning Paper” review or a video from POPL 2018.
FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems — novel approach of systematically exploring interleavings in distributed systems augmented with static analysis and prioritization. This approach is faster than previous techniques and found old and new bugs in several systems (Cassandra, Ethereum Blockchain, Hadoop, Kudu, Raft LogCabin, Spark, ZooKeeper).
What bugs cause cloud production incidents? — research focused on bugs (and their resolution strategies) that actually cause production incidents in large-scale distributed services at Microsoft Azure.
Torturing Databases for Fun and Profit — checking ACID guarantees of open source and commercial databases under power loss, additional material
Toward a Generic Fault Tolerance Technique for Partial Network Partitioning — overview of netrwork partition failures in various distributed systems (MongoDB, HBase, HDFS, Kafka, RabbitMQ, Elasticsearch, Mesos, etc), common traits among them and strategies to mitigate those failures.

Technologies for Testing Distributed Systems by Colin Scott

Colin Scott shares his viewpoint from academia on testing distributed systems,
specifically regression testing for correctness and performance bugs.

Technologies for Testing Distributed Systems, Part I
See also post Distributed Systems Testing: The Lost World by Crista Lopes

Testing in a Distributed World by Ines Sombra (RICON 2014)

Great overview of techniques for testing distributed systems from practitioner, the video did age well and still extremely good overview of the landscape.
Additional materials could be found in this Github repo

Resilience In Complex Adaptive Systems

These materials are not directly related to testing distributed systems, but they greatly contribute to general understanding of such systems.

Jepsen

State of the art approach to testing stateful distributed systems.

Jepsen Analyses — most recent Jepsen analyses of different distributed systems
Jepsen Talks — talks by Kyle Kingsbury on various conferences
Aphyr’s Jepsen posts — older Jepsen analyses on Kyle Kingsbury’s (Aphyr) personal site
Jepsen Talks on Github — Jepsen talks slides before 2015 on Github
Kyle Kingsbury on InfoQ
Call me maybe: Jepsen and flaky networks — talk on Jepsen, not by Kyle
Jepsen is used by Microsoft CosmosDB — founder of Azure CosmosDB confirms, that they are using Jepsen

Elle transactional consistency checker for black-box databases:

Elle source code
Black-box Isolation Checking with Elle — talk Kyle gave at CMU DB database seminar descibing Elle and results obtained with it
Elle: Inferring Isolation Anomalies from Experimental Observations — paper on Elle design by Kyle Kingsbury and Peter Alvaro

Some notable Jepsen analyses:

Jepsen is used by CockroachDB, VoltDB, Cassandra, ScyllaDB and others.

Formal Methods

The verification of a distributed system By Caitie McCaffrey also podcast and talk on InfoQ.com and accompanying materials on GitHub and a slidedeck
Designing Distributed Systems in TLA+ by Hillel Wayne, and talk Everything about distributed systems is terrible
Comparisons of Alloy and Spin
Verdi: Formally Verifying Distributed Systems
Verdi — A framework for formally verifying distributed systems implementations in Coq
Network Semantics for Verifying Distributed Systems
Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it) — using formal verification to find a bug in TimSort sorting algorithm
Proving JDK’s Dual Pivot Quicksort Correct — analizying quicksort implementation in Java

Companies using TLA+ to verify correctness of algorithms:

Amazon Web Services
PingCap for TiDB
MongoDB
Microsoft for services in Azure cloud
Confluent for Apache Kafka

Lineage-driven Fault Injection

Netflix adopted lineage-driven fault injection techniques for testing microservices.

Chaos Engineering

Principles of Chaos Engineering
Free Chaos Engineering book by Netflix engineers
A curated list of awesome Chaos Engineering resources

Netflix pioneered chaos engineering discipline.

Fuzzing

There are two flavors of fuzzing. First, randomized concurrency testing, where the ordering of messages is fuzzed:

And input fuzzing, where message contents or user inputs are fuzzed:

DNS parser, meet Go fuzzer
Fuzz Testing with afl-fuzz (American Fuzzy Loop)
Randomized testing for Go and talk on this tool GopherCon 2015: Dmitry Vyukov — Go Dynamic Tools
Simple guided fuzzing for libraries using LLVM’s new libFuzzer
LibFuzzer – a library for coverage-guided fuzz testing
How Heartbleed could’ve been found — example of how fuzzing could be used for finding famous HeartBleed vulnerability

Microservices

Amazing and comprehensive overview of different strategies to test systems built with microservices by Cindy Sridharan.

Testing Microservices, the sane way

Series of blog posts specifically on testing in production — best practices, pitfaults, etc:

Game Days

Sometimes Kill -9 Isn’t Enough

Performance and Benchmarking

Your Load Generator Is Probably Lying To You
Everything You Know About Latency Is Wrong — great overview of Gil Tene`s “How NOT to Measure Latency” talk
“How NOT to Measure Latency” by Gil Tene
“Benchmarking: You’re Doing It Wrong” by Aysylu Greenberg
Performance Analysis Methodology — approaches developed by Brendan Gregg for analysing performance in systematic fashion

Test Case Reduction

Minimizing Faulty Executions of Distributed Systems — reducing the size of buggy executions to make them easier to understand. 60 minute talk here
Troubleshooting Blackbox SDN Control Software with Minimal Causal Sequences — similar to above, but requires less instrumentation.
Concurrency Debugging with Differential Schedule Projections — find and minimize concurrency bugs using program analysis. Shared memory systems are equivalent to message passing systems, so you can apply the same techniques to distributed systems.

Misc

“Simulation Testing” by Michael Nygard
Testing Distributed Systems for Linearizability
Metamorphic Testing — overview of what metamorphic testing is and where it can help.
For more details see paper “Metamorphic Testing: A Review of Challenges and Opportunities”.

Specific approaches in different distributed systems

Amazon Web Services

The Evolution of Testing Methodology at AWS: From Status Quo to Formal Methods with TLA+
Use of Formal Methods at Amazon Web Services
CACM Article “How Amazon Web Services Uses Formal Methods”
Debugging Designs by Chris Newcombie there is also a source bundle
Millions of tiny databases — has section on testing which describes several approaches: SimWorld simulation resembling approach used at Foundation DB, use of Jepsen and formal methods and game days.
Using lightweight formal methods to validate a key-value storage node in Amazon S3 — paper on verifying correctness of a new key-value storage node implementation in S3. They are using property-based testing and stateless model checking extensively to balance trade-offs and follow pragmatic approach.
I gave a talk “Formal Methods at Amazon S3” on this paper for a reading group.

Netflix

Automated failure injection (see also Lineage-driven Fault Injection):

Random/manual failure injection testing:

Netflix Simian Army
Failure Injection Testing
From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
Breaking Bad at Netflix: Building Failure as a Service
GTAC 2014: I Don’t Test Often … But When I Do, I Test in Production — Netflix different testing strategies

Twitter

Cassandra

Testing Apache Cassandra with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Jepsen Cassandra Testing on Git
Netflix A STATE OF XEN — CHAOS MONKEY & CASSANDRA from Cassandra Summit 2015
Testing Apache Cassandra with Jepsen: How to Understand and Produce Safe Distributed Systems by Joel Knighton presented at Devoxx UK 2016
Testing Apache Cassandra 4.0 — quick overview of approaches used to test next major version of Cassandra
Fallout — tool to run distributed tests as a service. It is meant to easily orchestrate cluster creation and testing tools like Jepsen, performance testing tools and others, though extention and combining them in various ways with enviromental conditions. It could run tests either locally or on large scale clusters.
Cassandra Harry — Fuzz testing tool for Apache Cassandra. Aims to provide reproducible workloads to test correctness of Apache Cassandra.
Fuzz Testing and Verification of Apache Cassandra with “Harry” — talk on Harry fuzz testing tool by Alex Petrov at ApacheCon 2021
Harry, an Open Source Fuzz Testing and Verification Tool for Apache Cassandra by Alex Petrov — blog post about Harry fuzz testing tool for Apache Cassandra and how it helps to find bugs

ScyllaDB

They published series of blog posts on testing ScyllaDB:

Scylla testing part 1: Cassandra compatibility testing
Scylla testing part 2: Extending Jepsen for testing Scylla
CharybdeFS: a new fault-injecting filesystem for software testing
Testing part 4: Distributed tests
Testing part 5: Longevity testing
Fault-injecting filesystem cookbook
Video from Scylla Summit 2017 on testing
How We Constantly Try to Bring Scylla to its Knees and slides — overview of different testing types at ScyllaDB
Project Gemini: An Open Source Automated Random Testing Suite for Scylla and Cassandra Clusters — random test generator comparing results from cluster with injected faults against single node running without faults. Works on tops of CQL API and suitable for testing any database implementing it. See also talk on Project Gemini and open source code

VoltDB

Series of post on testing at VoltDB:

How We Test at VoltDB
Testing at VoltDB: SQLCoverage — describes how they test SQL query functionality using 5 millions queries generated from templates and comparing results against HSQLDB
Testing VoltDB Against PostgreSQL
VoltDB 6.4 Passes Official Jepsen Testing — VoltDB hired Kyle Kingsbury (Jepsen) to tests their database, they share results in this post

Additional resources:

“All In With Determinism for Performa

Testing Distributed Systems by ngaut

Testing Distributed Systems by ngaut

Share This Article

Newsletter

Overview of testing approaches

Research Papers

Technologies for Testing Distributed Systems by Colin Scott

Testing in a Distributed World by Ines Sombra (RICON 2014)

Resilience In Complex Adaptive Systems

Jepsen

Formal Methods

Lineage-driven Fault Injection

Chaos Engineering

Fuzzing

Microservices

Game Days

Performance and Benchmarking

Test Case Reduction

Misc

Specific approaches in different distributed systems

Amazon Web Services

Netflix

Twitter

Cassandra

ScyllaDB

VoltDB

HackTech

Leave a comment Cancel reply

Editor's Choice

Testing Distributed Systems by ngaut

Testing Distributed Systems by ngaut

Share This Article

Newsletter

Overview of testing approaches

Research Papers

Technologies for Testing Distributed Systems by Colin Scott

Testing in a Distributed World by Ines Sombra (RICON 2014)

Resilience In Complex Adaptive Systems

Jepsen

Formal Methods

Lineage-driven Fault Injection

Chaos Engineering

Fuzzing

Microservices

Game Days

Performance and Benchmarking

Test Case Reduction

Misc

Specific approaches in different distributed systems

Amazon Web Services

Netflix

Twitter

Cassandra

ScyllaDB

VoltDB

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter