Understanding Paxos Consensus

18 Feb 2022 — 10 min read

In distributed systems, it is often essential to make sure that all nodes in a cluster share the same information about the system’s state. We can use commitment protocols (e.g. 2PC, 3PC) to synchronize nodes, although reliance on a centralized unit of control makes this approach susceptible to failures. Moreover, commits can stall indefinitely when nodes become unresponsive. Paxos allows any node to act as the transaction coordinator. It can achieve progress even if nodes fail and messages get lost. The algorithm uses the Synod consensus at its core, with acceptors, proposers and learners communicating with each other. Below, we will take an incremental approach to build-up the complete machinery of the algorithm.

Synod Algorithm

The Synod algorithm provides the core consensus mechanism in Paxos. It uses a voting-based procedure to ensure consistent application of changes to the state of the system. The proposer node submits change requests, which are committed only if a majority of acceptors agree to apply them. Once accepted, all nodes should be able to see the change. The voting procedure must enforce a set of restrictions because nodes and networks are susceptible to arbitrary failures. To illustrate the constraints imposed by the Synod algorithm, let us consider a scenario with a cluster of 6 nodes, where node A acts as a proposer and nodes B-F function as acceptors.

R1 | Unique Identifiers

The proposer can send a value to a set of acceptors, which can accept the submitted change. Once a large enough set of acceptors agrees on the proposed value, the system will update its state. Previous commits might be overwritten in this procedure, leading us to the first requirement imposed by the algorithm:

R1 | Each proposal must have a unique identifier.

Using (sequence_id, node_id) tuples provides a distinct identifier for each proposed value. If two node submit a proposal with the same sequence_id, then we can order them based on the node_id:

(13, Adam) < (14, Benny) < (14, Susan)

An alternative approach is to use slotted timestamps, with each node using its own pre-allocated range.

R2 | Quorums Must Overlap

To explain the second rule enforced by the Synod algorithm, let us consider a proposal that needs a quorum of 2 votes to pass. The proposed value is committed if, for example, nodes B and C accept the proposal issued by A, even if nodes D, E, F are temporarily unavailable. Now, if nodes A, B, C suddenly fail, and D, E, F become available; it is possible that D, E, F never learn about the changes committed previously.

In order to prevent such scenarios, we must make sure that there is an overlap between nodes participating in any two voting procedures.

R2 | The quorums of any two votes must have at least one node in common.

In the example above, information can propagate if we require that at least one node from the initial vote is available during the second ballot:

We always satisfy the requirement imposed by rule R2, if we require quorums to be formed by majorities only. Two majorities reach consensus at some point due to overlap

R3 | Choice is Permanent

With constraints R1 and R2 in place, we can order changes to the system’s state, and make sure that information can spread between nodes. However, because information might spread at arbitrary rates, there is one more issue to consider. Let us suppose that node A has proposed a change x=53, which was accepted by a quorum of C, D, E, F. Now, assume that responses from acceptors get delayed, and proposer B issues another proposal x=1 in the meantime. If nodes C,D, E, F all accept B's change, the value committed in the system might have already changed when A receives the acceptance corresponding to its vote.

Therefore, to make the system converge into a unified state, we must impose one more requirement:

R3 | If a proposal with value v is chosen, then every higher numbered proposal issued by any proposer has value v.

Because of rule R2, if a value was committed before, there will always be at least one node with that value in the current quorum. Considering the order requirement from rule R1, we can rephrase R3:

R3 | For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such:
- no acceptor in S accepted any proposal numbered less than n, OR
- V is the value of the highest-numbered proposal among all proposals numbered less than n, accepted by an acceptor in S.

Applying constraint R3 to our example, we can see the system converging into a consistent state:

Once a value is committed, the proposer must stick with it in future ballots. This is how consensus on a value propagates to other nodes, and no inconsistencies can arise.

Proposer Algorithm

Considering constraints R1-R3 of the synod algorithm, we can derive a description of the proposer node, with its operation divided into two stages:

Prepare Phase - a proposer chooses a new proposal number n > lastBallot, updates lastBallot=n, and sends a request Prepare(n) to some set of acceptors, asking to respond with LastVoted(n', v, promise) containing:
- n' - the proposal with the highest number less than n that each acceptor accepted, if any.
- v - the value corresponding to proposal n'.
- promise - a promise not to accept a proposal numbered less than n. This makes sure that n' does not become invalid, between the time when the proposer gets the LastVoted(), and the time it issues a proposal to the acceptor.
If the proposer cannot get the required responses, it will time-out and can retry with another identifier.
Accept Phase - if the proposer receives the requested responses from a majority of the acceptors during the prepare phase, then it can issue a proposal with number n and value v, where:
- v is the value of the highest-numbered proposal among the responses OR
- v is any value selected by the proposer if the responders reported no proposals
The proposer writes the proposal into memory [important for failure recovery], and sends out a Accept(n, v) to all acceptors. When the proposer receives a majority of Voted(n) messages concerning a particular proposal, it records v in its memory. If the proposer times out while waiting for the Voted(n) messages, it can retry by issuing a proposal with a new identifier.

Acceptor Algorithm

The acceptor can receive the Prepare(n) and Accept(n, v) requests from the proposer, and is free to respond or ignore any of them without affecting consistency:

Prepare Phase - the acceptor can receive Prepare(n) with a sequence number n which can fall in one of two ranges:
- If n is greater than that of any Accept() it has already voted for [n > lastAccept], the acceptor updates lastPrepare = n and responds with LastVoted(n', v, promise) which includes:
  - n' - the highest-numbered proposal it accepted n' = lastAccept, or null
  - v - the value corresponding to n'
  - promise - a promise not to accept any requests < n
- If n is smaller than lastAccept on the acceptor, than it should ignore the Prepare() request since it won't vote for a corresponding Accept() request anyways. To improve performance, it can inform the proposer about its state, such that the proposer can abandon the proposal early on.
Accept Phase - the acceptor can accept a proposal numbered n if it has not responded to a Prepare() with a higher identifier in the meantime. This can be ensured if the condition n == lastPrepare is true. If it decides to accept, it records the vote in its memory, updates lastAccepted = n, and sends a Voted(n) message to the proposer and all learners.

Learner Algorithm

After acceptors decide to vote for a proposal, they additionally send a Voted(n) message to all learner nodes. Learners know the number of acceptors in a system, such that they can update their state depending on the number of votes. Due to message loss, there might be cases where learners do not discover a committed value. They can prompt the proposer for a proposal to guarantee that a value was committed in these cases.

Note
Issuing proposals for uncommitted value, commits the value if successful.

Issuing new proposals for an already committed value does not affect the state - consensus is always on value, not on the proposal id.

Complete System

Until now, we have discussed constraints R1-R3 and the function of proposers, acceptors and learners. As a next step, let us look at an example illustrating operation of the Synod consensus, involving proposers A, B, acceptors C, D, E, and a learner F.

Node A proposes a value of x=7, which times-out because it is accepted only by C, with acceptor D down, and E unreachable.
B proposes a value of x=55, but it also times-out since only acceptor E is available.
Proposer A learns from acceptors C and E that the latest value any of them voted for is x=55, which becomes the only value that A can propose now [ see R3]. A proposal for x=55 is issued by A, and is accepted by both C and E. This is picked-up by the learner, which can update its internal state.
Proposer B learns that acceptor D, which just became available, has not voted for anything yet. This does not have much influence on the behavior of the system, since C and E already voted for x=55. Therefore, B can only issue a proposal for x=55, which is accepted by all nodes.

A Paxos node can take on multiple roles, acting as proposer, learner and/or acceptor all at once.

Consistency

Message loss cannot corrupt the state of the system, since it can only prevent an action from happening. Receiving multiple copies of a message can cause an action to be repeated. This also has no effect on the consistency, as sending several Voted(n) messages has the same effect as sending just one. The only way for consistency to be affected is if a proposer has to resend the same change under new ballots because of some failure on its side. However, the proposer prevents this issue by storing the information about lastBallot to memory, so it can recover to its original state.

Progress

The algorithm discussed up to this point ensures consistency of the system’s state. However, we can construct a scenario where two proposers repeatedly send proposals with increasing identifiers. None of those requests is ever chosen, because acceptors will drop any proposal with identifiers lower than the last one received [see R3].

To guarantee progress, we need to ensure that contention cannot occur. One potential solution is to designate a distinguished node as the only proposer in the system. If the distinguished proposer can communicate with most of the acceptors, it will eventually be able to issue a proposal with an identifier high enough to be accepted. To pinpoint which node should act as the proposer, a leader election algorithm can be used. In systems where multiple proposers are needed, we can limit contention using back-off techniques.

From Synod to Paxos

A single run of the Synod algorithm aims to reach consensus on a single value v. Once nodes agree on a certain proposal, the system cannot progress to another value. Paxos generalizes that approach to an arbitrary number of arguments by implementing a sequence of separate instances of the Synod algorithm for each value v_k. To illustrate the operation of the full Paxos algorithm, we start with a system in the following state:

Note
The entry Ek is not the same as the identifier IDk.
- Ek corresponds to the k'th change applied to the state of the system
- Vk / IDk are the latest value and id corresponding to the execution of Ek's synod consensus

Let's suppose node A issues proposals for entries E32, E33, E34, and E35 without awaiting them to complete. Because of several failures, it might happen that E32 and E35 reach consensus, while E33 and E34 time out.

Now, let us imagine that the node A fails and node B is elected as the new proposer. In a distributed state machine, the operation related to the state at entry Ek+1 might depend on all previous entries. We must therefore make sure that all entries up to Ek reached consensus before the proposer can move on to process Ek+1. Therefore, B must synchronize its state up to the last entry, which was seen by the system. To do so, it finds the first entry for which it does not know the status, i.e. E32. To determine the state of the whole system, B executes the prepare phase of the Synod algorithm for all uncertain entries from E32 onwards. Based on the information it receives from acceptors, node B concludes that:

Entry E32 was successfully committed, such that it must be introduced into B's ledger.
For entries E33 and E34 there is no value committed in the system. Because there are committed entries after E34, node B identifies E33 and E34 as gaps and fills in a dummy value into those positions.
Entry E35 was successfully committed and is already available in B's memory.
No values were committed for entries from E36 onwards, which is where B can start processing new requests.

After B has determined the state of the system, it executes the accept phase for entries E32-E35. This brings the system into the following state:

From this point onward, proposer B can issue proposals by executing the accept phase of the Synod algorithm beginning at E36. Since it is the only proposer in the system, it can be sure that execution of the prepare phase is redundant at this stage.

Optimization
A new proposer must query acceptors about the status of all unknown entries. To improve the performance of the system, we could implement a dedicated command for this task. If acceptors can report the state of a range of entries at a time, we reduce the number of needed queries.

Adding/Removing Nodes

When a node joins a cluster, it can initialize its state based on a state dump fetched from centralized storage. It becomes a learner afterwards, synchronizing its state similarly to a newly elected proposer. Removing nodes is trivial and can be seen as node failure.

[1] Lamport, Leslie. “The part-time parliament.” ACM Trans. Comput. Syst. 16 (1998): 133-169.
[2] Lamport, Leslie. “Paxos Made Simple.” (2001).