See Jepsen Findings OnDemand!
Watch the recorded webcast of Kyle Kingsbury (@jepsen_io) to see how Redpanda fared in his recent Jepsen Test.
Introduction
The sad reality of physics is that you don’t have a say. Computers will crash, hard drives will fail, and your cat will unplug your network cable — facts.
Redpanda is a new streaming storage engine, designed from the ground up to keep your data safe despite the reality of physics. We use formally verified protocols like Raft and two-phase commit to remain consistent and available during failures such as network partitions, slow disks, faulty filesystems, etc.
But the fact that our chosen protocols work in theory doesn’t guarantee that an implementation won’t contain optimization-induced bugs. We need independent and empirical evidence of correctness.
In this post, we’ll discuss in detail how Redpanda fared in our Jepsen Report. We’re sharing this because we believe the more transparent we are with our community, the better Redpanda will serve the developers and engineers who want an insanely fast event streaming platform that’s also simple to use. That said, let’s jump into what Jepsen testing is and what it found.
What is Jepsen?
Jepsen is a company that provides auditing services in the domain of distributed systems. Software development teams partner with Jepsen to check that their software does what they say it does.
Writing correct programs is a challenge, and writing concurrent programs is even more challenging. Writing “correct” distributed programs is a next-level challenge because it requires that engineers think not only about the edge cases and implicit memory state, but also about the implicit state of the hardware and network.
Since it’s not uncommon for mistakes to occur, passing the Jepsen system audit acts as a public quality indicator.
Typically, a Jepsen partnership lasts several months and includes analyzing the documentation, writing custom test harnesses, and running multiple tests with specialized fault injections. Jepsen uses an open source framework (also called Jepsen) to identify consistency issues. It’s useful to think about the framework the same way we think about unit testing frameworks – JUnit, ScalaTest, and unittest. They simplify writing unit tests, but the real value comes from the tests and not the libraries.
The same holds true with Jepsen. The framework may be used by a company on its own, but the real value comes from the expertise of the people behind it and their ability to tune the framework and write new tests covering a system.
Redpanda partnered with Jepsen and worked together for several months to make sure that Redpanda doesn’t have consistency issues by the end of the partnership.
Redpanda’s Jepsen results
How did Redpanda fare in its Jepsen testing? Redpanda is a safe system without known consistency problems. The consensus layer is solid. The idempotency and transactional layers had issues that we have already fixed. The only consistency findings we haven’t addressed reflect unusual properties of the Apache Kafka ® protocol itself, rather than of a particular implementation.
Let’s review the Jepsen report. The discovered issues can be classified into two categories – consistency and availability – based on which of the fundamental properties of the distributed systems they affect: safety or liveness.
Safety
The safety property requires that something bad will never happen, particularly that a system doesn’t violate its specification. For example:
- Under RYW (Read-Your-Writes), a read following a write issued in the same session must see the effects of the write (unless it’s overwritten).
- With linearizability, the effects of the updates/reads are totally ordered and the order is consistent with the wall clock.
- Under SI (Snapshot Isolation) a multi-read must return data belonging to the same snapshot.
Safety is like trust in that even a single incident compromises it.
The practical implication of having a consistent system is to ensure an application developer to rely on the guarantees instead of writing code to counter the anomalies.
The Jepsen testing confirmed the consistency bugs that we had already found (3039, 3003, 3036) using in-house chaos tests, revealed new safety problems and verified that we fixed all of them. Overall, the report mentions the following consistency issues:
The aforementioned “by design” issues (highlighted in pink) are fundamental parts of the Kafka protocol. Viewed from the lens of a database transactional model, they might look like safety violations, but they are not. Instead, they are the result of a design decision that affects all Kafka protocol implementations, including Kafka, Redpanda and Pulsar, and that behavior is familiar to Kafka practitioners.
We discuss write cycles below while KAFKA-13574 is described in the linked issue and the internal non-monotonic polls are caused by the degenerate group rebalancing (see the report).
Liveness
Safety doesn’t cover non-functional requirements. A system may reject every request and still be deemed 100% safe — the system doesn’t make progress so it can’t be caught lying.
But when a system is down, it is far from being useful. The liveness property covers this gap and states a system should make progress. Unfortunately, total availability is impossible to achieve with consistent systems. Even when the system is perfect and doesn’t have bugs, a faulty network may completely stall the progress. See “Impossibility of Distributed Consensus with One Faulty Process” (the FLP result) for details.
This impossibility result shifts discussion into the probability domain and is the reason why we talk about the number of nines in the context of high availability, while never reaching the limit of 100% availability. Jepsen revealed the following availability issues in Redpanda:
We’re still investigating the highlighted issues. Fortunately, in terms of impact and frequency, those availability issues have more leeway around when we address them — especially compared to consistency bugs. They don’t cause data loss and don’t reorder the events.
What did we learn from the partnership?
Kyle Kingsbury, creator of the Jepsen testing framework, helped us to look at the Kafka transactional protocol with a fresh set of eyes and to recognize the fundamental differences between Kafka and database transactional models.
Write-write conflicts in the Kafka protocol
The database world has isolation levels to describe the variations between different transaction models: read committed, snapshot isolation, serializability, strict serializability, etc. One level is stronger than another when it prevents more anomalies. For example, read committed is the weakest isolation level because