A Paper a Week-ish #10: In Search of an Understandable Consensus Algorithm

I have been meaning to read this paper for quite some time, but for some reason, even though its entire point is proposing an understandable consensus algorithm, Paxos’ (the canonical consensus algorithm previous to this paper) reputation for being impossible to understand kept me from reading this paper whenever I was short on time – that is to say always. Probably I was assuming that an “understandable” replacement for an impossible-to-understand is still very hard to grasp and I would not have time to properly digest it. Thankfully, I could not have been wronger.

Raft, the algorithm proposed in the paper, is a solution to the replicated state machine problem. The basic premise thereof is a cluster of computers running independently but working together in what appears to be a single consistent state machine. Without auxiliary conditions, this is a quite simple goal to achieve: one could simply choose one of these computers to actually do the work and leave the rest idle. Of course this is not the point of the exercise, so we want the state machine to be available as long as the majority of these computers is still up, while still giving only correct result. The paper refers to these properties as availability and safety, respectively.

The problem is simplified vastly by requiring that one of these computers is master at any given time, and clients only talking to this master. This is achieved by a fairly simple protocol where, upon detecting the current master has failed, a server initiates an election by talking to all other servers, asking them to vote for it. Other servers will vote only for other servers that are more up-to-date. If more than half of the servers vote for it, it becomes the new master. The protocol also shows that simplicity was the main design goal of Raft: to prevent servers from competing for the vote, which may lead to elections that need to be repeated for lack of winner, the server simply waits for a random time before calling the election, which makes it unlikely that two or more servers do so at the same time.

This server is then responsible for dealing with users requests. If a user wants to update the state, this change is written into the so-called log of the master and replicated onto the other servers called followers. Under normal operation, it will be considered to be committed and persistently in the system when it is written to more than half of the cluster (some caveats apply). Any data that was committed will be available as long as more than half the servers are up at any given point. The master always makes sure that when writing an entry on another server, the entry immediately before the one written matches the one in its logs. By induction, this means the follower’s log is complete up to the newly written entry.

If any of the servers except the master fails, nothing interesting happens. The master will just no longer be able to talk to it, it will no longer write data, but updates will be resent in case it comes back. Simple failover on master failure is equally straightforward. The master dies, the remaining servers elect a new master that has the same state as the previous master (because, remember that the updates are replicated to more than half of the servers). This new master takes the place of the old one, and nothing particularly interesting happens either.

Of course, no one would care if the paper was simply about trivial failover. The interesting case arises, of course, due to more complicated series of master failures. But because of the simplicity of the algorithm this is, in fact, not much harder to understand than the trivial cases above. In this blog-post I have hand-waved “more up to date“ and “committed” a bit earlier, these two concepts are important to understand the general case. All log entries are identified by two integers: the term (the term increases every time the mastership changes), and a sequence number within that term. A server can only win the mastership election if either its term is higher than more than half of the others, or it is in the same term but has a longer (or equally long) log in it. This makes sure that the elected master always has a complete log of committed entries. An entry is committed only after the current master has written a log-entry in its term to ensure that more than half of the followers catch up with the master.

Find the paper here: Raft Algorithm Paper.