Even after 10 years of programming, I still have a relentless curiosity about new software algorithms, reading papers and blog posts, and learning from other engineers. The best part, however, really comes when you have the opportunity to implement one algorithm and even customize it for your specific use case. In this post, I will walk you through my journey from understanding cluster membership fundamentals to the implementation of Chitchat, our Rust implementation of the Scuttlebutt algorithm with a phi accrual failure detector.
I will first introduce the cluster membership subject and give some details about the current membership algorithm used in Quickwit (SWIM). Why we decided to move away from it, despite being slightly faster conceptually than the successor. Then, I will explore in-depth how scuttlebutt and the failure detector algorithms work, how we have implemented them. Finally, I will describe a couple of real-world issues that we have encountered and how we solved them.
What is cluster membership?
Given a cluster of nodes, cluster membership is the sub-system that allows each node to know the list of its peers. It detects node failure and eventually, make all other nodes aware that a failed node is no longer a member of the cluster.
One common way to address this problem is to have a monitoring node in charge of checking the health of all the other nodes by running a heart-beating scheme. This approach works well for a few nodes but shows hot spots as the cluster gets larger. Another way is to put all the nodes in charge of monitoring. While this avoids hot spots, all existing heart-beating scheme offer different levels of scalability and accuracy. Some generate a lot of network traffic while others might take a bit of time to converge. All these issues combined make cluster membership a tricky engineering problem.
What is SWIM and why are we moving away from SWIM?
Since our first release, Quickwit has had a cluster membership feature in order to provide distributed search. SWIM is the algorithm currently used for this feature. It is based on a gossip style that is referred to as dissemination aka “rumor-mongering”. Scuttlebutt is different and is based on another gossip style called “anti-entropy”. Robbert et al in their paper explained the difference between both gossip approaches as follows:
Anti-entropy protocols gossip information until it is made obsolete by newer information, and are useful for reliably sharing information among a group of participants. Rumor-mongering has participants gossip information for some amount of time chosen sufficiently high so that with high likelihood all participants receive the information.
Next is a real-world example we like to use for explaining the differences between rumor-mongering and anti-entropy:
- Rumor-mongering: consider a piece of breaking news you just read from your local newspaper and decide to inform all your contacts. Those you inform also decide to inform their contacts. This style of gossiping spreads the news very quickly. However, there will be a time when people lose interest in the news, stop spreading it, and not everyone gets the chance to be informed.
- Anti-entropy: let’s suppose everyone in town on regular basis talks to a few of his contacts (3 to 5) to keep up with any news in town. This type of information exchange is slower because of the number of selected contacts. However, since everyone does this perpetually, they are guaranteed to be informed about the latest news no matter what time it takes.
The key difference here is that with SWIM, a node may miss some propagated messages within the cluster, which can lead to false-positive failure detection. Hashicorp for instance had to extend their production-grade SWIM implementation Serf with Lifeguard. Lifeguard is a set of three extensions destined to reduce false positive failure detection.
We also struggled to find a suitable library-oriented implementation of SWIM in Rust. Though we found Artillery very useful to start with and want to thank all the contributors, we wanted a more battle-tested implementation like Serf in Go.
Moreover, we found that scuttlebutt as an algorithm:
- Is easier to understand and implement correctly.
- Allows nodes to share/advertise information about themselves (service ports, available ram/disk) without any special logic.
- Is battle-tested in production-grade systems such as Apache Cassandra
How scuttlebutt works?
Scuttlebutt is a gossip algorithm with a reconciliation technique fully described in this paper. In scuttlebutt, every node keeps a local copy of the cluster state which is a map of node ID to node state. Think of it as a key-value store namespaced by node ID in which a node is only allowed to modify (create/update/delete) its own namespace. A node can apply the changes it perceives from other nodes while gossiping. However, it cannot directly update the other node’s states. Because scuttlebutt employs the anti-entropy gossip technique, all the nodes in the cluster eventually get the latest cluster state at some point. Also, notice how the concept is based on key-value store, making it easy for nodes to share information.
The following is a JSON representation of node node-1/1647537681
view of the cluster state. The number following node-1
is a timestamp and you will soon understand why we added that number. Notice how each node advertises its own grpc_address
and heartbeat
counter.
{
"seed_nodes": [
"127.0.0.1:7281"
],
"node_states": {
"node-1/1647537681": {
"key_values": {
"grpc_address": {
"value": "0.0.0.0:7282",
"version": 2
},
"heartbeat": {
"value": "1002",
"version": 1004
}
},
"max_version": 1004
},
"node-2/1647537802": {
"key_values": {
"grpc_address": {
"value": "0.0.0.0:8282",
"version": 2
},
"heartbeat": {
"value": "991",
"version": 993
}
},
"max_version": 993
},
"node-3/1647538101": {
"key_values": {
"grpc_address": {
"value": "0.0.0.0:9282",
"version": 2
},
"heartbeat": {
"value": "92",
"version": 94
}
},
"max_version": 94
}
}
}
The gossip protocol works as follows:
- Every second, a node randomly selects a few (3 in our case) nodes to gossip with.
- To make this node selection a bit smarter, we randomly include:
- A seed node if not selected already
- A dead node (to determine whether it is back online)
info
The gossip frequency and the number of sele