Understanding Paxos Consensus
In distributed systems, it is often essential to make sure that all nodes in a cluster share the same information about the system’s state. We can use commitment protocols (e.g. 2PC, 3PC) to synchronize nodes, although reliance on a centralized unit of control makes this approach susceptible to failures. Moreover, commits can stall indefinitely when nodes become unresponsive. Paxos allows any node to act as the transaction coordinator. It can achieve progress even if nodes fail and messages get lost. The algorithm uses the Synod consensus at its core, with acceptors
, proposers
and learners
communicating with each other. Below, we will take an incremental approach to build-up the complete machinery of the algorithm.
Synod Algorithm
The Synod algorithm provides the core consensus mechanism in Paxos. It uses a voting-based procedure to ensure consistent application of changes to the state of the system. The proposer node submits change requests, which are committed only if a majority of acceptors agree to apply them. Once accepted, all nodes should be able to see the change. The voting procedure must enforce a set of restrictions because nodes and networks are susceptible to arbitrary failures. To illustrate the constraints imposed by the Synod algorithm, let us consider a scenario with a cluster of 6 nodes, where node A
acts as a proposer and nodes B-F
function as acceptors.
R1 | Unique Identifiers
The proposer can send a value to a set of acceptors, which can accept the submitted change. Once a large enough set of acceptors agrees on the proposed value, the system will update its state. Previous commits might be overwritten in this procedure, leading us to the first requirement imposed by the algorithm:
Using (sequence_id, node_id)
tuples provides a distinct identifier for each proposed value. If two node submit a proposal with the same sequence_id
, then we can order them based on the node_id
:
(13, Adam) < (14, Benny) < (14, Susan)
An alternative approach is to use slotted timestamps, with each node using its own pre-allocated range.
R2 | Quorums Must Overlap
To explain the second rule enforced by the Synod algorithm, let us consider a proposal that needs a quorum of 2 votes to pass. The proposed value is committed if, for example, nodes B
and C
accept the proposal issued by A
, even if nodes D
, E
, F
are temporarily unavailable. Now, if nodes A
, B
, C
suddenly fail, and D
, E
, F
become available; it is possible that D
, E
, F
never learn about the changes committed previously.
In order to prevent such scenarios, we must make sure that there is an overlap between nodes participating in any two voting procedures.
In the example above, information can propagate if we require that at least one node from the initial vote is available during the second ballot:
We always satisfy the requirement imposed by rule R2
, if we require quorums to be formed by majorities only. Two majorities reach consensus at some point due to overlap
R3 | Choice is Permanent
With constraints R1
and R2
in place, we can order changes to the system’s state, and make sure that information can spread between nodes. However, because information might spread at arbitrary rates, there is one more issue to consider. Let us suppose that node A
has proposed a change x=53
, which was accepted by a quorum of C
, D
, E
, F
. Now, assume that responses from acceptors get delayed, and proposer B
issues another proposal x=1
in the meantime. If nodes C
,D
, E
, F
all accept B
's change, the value committed in the system might have already changed when A
receives the acceptance corresponding to its vote.
Therefore, to make the system converge into a unified state, we must impose one more requirement:
v
is chosen, then every higher numbered proposal issued by any proposer has value v
.Because of rule R2
, if a value was committed before, there will always be at least one node with that value in the current quorum. Considering the order requirement from rule R1
, we can rephrase R3
:
v
and n
, if a proposal with value v
and number n
is issued, then there is a set S
consisting of a majority of acceptors such:- no acceptor in
S
accepted any proposal numbered less than n
, OR-
V
is the value of the highest-numbered proposal among all proposals numbered less than n
, accepted by an acceptor in S
.Applying constraint R3
to our example, we can see the system converging into a consistent state:
Once a value is committed, the proposer must stick with it in future ballots. This is how consensus on a value propagates to other nodes, and no inconsistencies can arise.
Proposer Algorithm
Considering constraints R1
-R3
of the synod algorithm, we can derive a description of the proposer node, with its operation divided into two stages:
-
Prepare Phase - a proposer chooses a new proposal number
n > lastBallot
, updateslastBallot=n
, and sends a requestPrepare(n)
to some set of acceptors, asking to respond withLastVoted(n', v, promise)
containing:-
n'
- the proposal with the highest number less thann
that each acceptor accepted, if any. -
v
- the value corresponding to proposaln'
. -
promise
- a promise not to accept a proposal numbered less thann
. This makes sure thatn'
does not become invalid, between the time when the proposer gets theLastVoted()
, and the time it issues a proposal to the acceptor.
If the proposer cannot get the required responses, it will time-out and can retry with another identifier.
-
-
Accept Phase - if the proposer receives the requested responses from a majority of the acceptors during the prepare phase, then it can issue a proposal with number
n
and valuev
, where:v
is the value of the highest-numbered proposal among the responses ORv
is any value selected by the proposer if the responders reported no proposals
The proposer writes the proposal into memory [important for failure recovery], and sends out a
Accept(n, v)
to all acceptors. When the proposer receives a majority ofVoted(n)
messages concerning a particular proposal, it recordsv
in its memory. If the proposer times out while waiting for theVoted(n)
messages, it can retry by issuing a proposal with a new identifier.
Acceptor Algorithm
The acceptor can receive the Prepare(n)
and Accept(n, v)
requests from the proposer, and is free to respond or ignore any of them without affecting consistency:
- Prepare Phase - the acceptor can receive
Prepare(n)
with a sequence numbern
which can fall in one of two ranges:- If
n
is greater than that of anyAccept()
it has already voted for [n > lastAccept
], the acceptor updateslastPrepare = n
and responds withLastVoted(n', v, promise)
which includes:n'
- the highest-numbered proposal it acceptedn' = lastAccept
, ornull
v
- the value corresponding ton'
promise
- a promise not to accept any requests< n
- If
n
is smaller thanlastAccept
on the acceptor, than it should ignore thePrepare()
request since it won't vote for a correspondingAccept()
request anyways. To improve performance, it can inform the proposer about its state, such that the proposer can abandon the proposal early on.
- If
- Accept Phase - the acceptor can accept a proposal numbered
n
if it has not responded to aPrepare()
with a higher identifier in the meantime. This can be ensured if the conditionn == lastPrepare
is true. If it decides to accept, it records the vote in its memory, updateslastAccepted = n
, and sends aVoted(n)
message to the proposer and all learners.
Learner Algorithm
After acceptors decide to vote for a proposal, they additionally send a Voted(n)
message to all learner nodes. Learners know the number of acceptors in a system, such that they can update their state depending on the number of votes. Due to message loss, there might be cases where learners do not discover a committed value. They can prompt the proposer for a proposal to guarantee that a value was committed in these cases.
Note
Issuing proposals for uncommitted value, commits the value if successful.
Issuing new proposals for an already committed value does not affect the state - consensus is always on value, not on the proposal id.
Complete System
Until now, we have discussed constraints R1
-R3
and the function of proposers
, acceptors
and learners
. As a next step, let us look at an example illustrating operation of the Synod consensus, involving proposers A
, B
, acceptors C
, D
, E
, and a learner F
.
- Node
A
proposes a value ofx=7
, which times-out because it is accepted only byC
, with acceptorD
down, andE
unreachable. B
proposes a value ofx=55
, but it also times-out since only acceptorE
is available.- Proposer
A
learns from acceptorsC
andE
that the latest value any of them voted for isx=55
, which becomes the only value thatA
can propose now [ seeR3
]. A proposal forx=55
is issued byA
, and is accepted by bothC
andE
. This is picked-up by the learner, which can update its internal state. - Proposer
B
learns that acceptorD
, which just became available, has not voted for anything yet. This does not have much influence on the behavior of the system, sinceC
andE
already voted forx=55
. Therefore,B
can only issue a proposal forx=55
, which is accepted by all nodes.
A Paxos node can take on multiple roles, acting asproposer
,learner
and/oracceptor
all at once.
Consistency
Message loss cannot corrupt the state of the system, since it can only prevent an action from happening. Receiving multiple copies of a message can cause an action to be repeated. This also has no effect on the consistency, as sending several Voted(n)
messages has the same effect as sending just one. The only way for consistency to be affected is if a proposer has to resend the same change under new ballots because of some failure on its side. However, the proposer prevents this issue by storing the information about lastBallot
to memory, so it can recover to its original state.
Progress
The algorithm discussed up to this point ensures consistency of the system’s state. However, we can construct a scenario where two proposers repeatedly send proposals with increasing identifiers. None of those requests is ever chosen, because acceptors will drop any proposal with identifiers lower than the last one received [see R3
].
To guarantee progress, we need to ensure that contention cannot occur. One potential solution is to designate a distinguished node as the only proposer in the system. If the distinguished proposer can communicate with most of the acceptors, it will eventually be able to issue a proposal with an identifier high enough to be accepted. To pinpoint which node should act as the proposer, a leader election algorithm can be used. In systems where multiple proposers are needed, we can limit contention using back-off techniques.
From Synod to Paxos
A single run of the Synod algorithm aims to reach consensus on a single value v
. Once nodes agree on a certain proposal, the system cannot progress to another value. Paxos generalizes that approach to an arbitrary number of arguments by implementing a sequence of separate instances of the Synod algorithm for each value v
k. To illustrate the operation of the full Paxos
algorithm, we start with a system in the following state:
Note
The entryEk
is not the same as the identifierIDk
.
-Ek
corresponds to the k'th change applied to the state of the system
-Vk
/IDk
are the latest value and id corresponding to the execution ofEk
's synod consensus
Let's suppose node A
issues proposals for entries E32
, E33
, E34
, and E35
without awaiting them to complete. Because of several failures, it might happen that E32
and E35
reach consensus, while E33
and E34
time out.
Now, let us imagine that the node A
fails and node B
is elected as the new proposer. In a distributed state machine, the operation related to the state at entry Ek+1
might depend on all previous entries. We must therefore make sure that all entries up to Ek
reached consensus before the proposer can move on to process Ek+1
. Therefore, B
must synchronize its state up to the last entry, which was seen by the system. To do so, it finds the first entry for which it does not know the status, i.e. E32
. To determine the state of the whole system, B
executes the prepare phase of the Synod algorithm for all uncertain entries from E32
onwards. Based on the information it receives from acceptors, node B
concludes that:
- Entry
E32
was successfully committed, such that it must be introduced intoB
's ledger. - For entries
E33
andE34
there is no value committed in the system. Because there are committed entries afterE34
, nodeB
identifiesE33
andE34
as gaps and fills in a dummy value into those positions. - Entry
E35
was successfully committed and is already available inB
's memory. - No values were committed for entries from
E36
onwards, which is whereB
can start processing new requests.
After B
has determined the state of the system, it executes the accept phase for entries E32
-E35
. This brings the system into the following state:
From this point onward, proposer B
can issue proposals by executing the accept phase of the Synod algorithm beginning at E36
. Since it is the only proposer in the system, it can be sure that execution of the prepare phase is redundant at this stage.
Optimization
A new proposer must query acceptors about the status of all unknown entries. To improve the performance of the system, we could implement a dedicated command for this task. If acceptors can report the state of a range of entries at a time, we reduce the number of needed queries.
Adding/Removing Nodes
When a node joins a cluster, it can initialize its state based on a state dump fetched from centralized storage. It becomes a learner afterwards, synchronizing its state similarly to a newly elected proposer. Removing nodes is trivial and can be seen as node failure.
[1] Lamport, Leslie. “The part-time parliament.” ACM Trans. Comput. Syst. 16 (1998): 133-169.
[2] Lamport, Leslie. “Paxos Made Simple.” (2001).