:••: dotclusters

Understanding Paxos Consensus

Robert Sander — Fri, 18 Feb 2022 16:59:00 GMT

In distributed systems, it is often essential to make sure that all nodes in a cluster share the same information about the system’s state. We can use commitment protocols (e.g. 2PC, 3PC) to synchronize nodes, although reliance on a centralized unit of control makes this approach susceptible to failures. Moreover, commits can stall indefinitely when nodes become unresponsive. Paxos allows any node to act as the transaction coordinator. It can achieve progress even if nodes fail and messages get lost. The algorithm uses the Synod consensus at its core, with acceptors, proposers and learners communicating with each other. Below, we will take an incremental approach to build-up the complete machinery of the algorithm.

Synod Algorithm

The Synod algorithm provides the core consensus mechanism in Paxos. It uses a voting-based procedure to ensure consistent application of changes to the state of the system. The proposer node submits change requests, which are committed only if a majority of acceptors agree to apply them. Once accepted, all nodes should be able to see the change. The voting procedure must enforce a set of restrictions because nodes and networks are susceptible to arbitrary failures. To illustrate the constraints imposed by the Synod algorithm, let us consider a scenario with a cluster of 6 nodes, where node A acts as a proposer and nodes B-F function as acceptors.

R1 | Unique Identifiers

The proposer can send a value to a set of acceptors, which can accept the submitted change. Once a large enough set of acceptors agrees on the proposed value, the system will update its state. Previous commits might be overwritten in this procedure, leading us to the first requirement imposed by the algorithm:

R1 | Each proposal must have a unique identifier.

Using (sequence_id, node_id) tuples provides a distinct identifier for each proposed value. If two node submit a proposal with the same sequence_id, then we can order them based on the node_id:

(13, Adam) < (14, Benny) < (14, Susan)

An alternative approach is to use slotted timestamps, with each node using its own pre-allocated range.

R2 | Quorums Must Overlap

To explain the second rule enforced by the Synod algorithm, let us consider a proposal that needs a quorum of 2 votes to pass. The proposed value is committed if, for example, nodes B and C accept the proposal issued by A, even if nodes D, E, F are temporarily unavailable. Now, if nodes A, B, C suddenly fail, and D, E, F become available; it is possible that D, E, F never learn about the changes committed previously.

In order to prevent such scenarios, we must make sure that there is an overlap between nodes participating in any two voting procedures.

R2 | The quorums of any two votes must have at least one node in common.

In the example above, information can propagate if we require that at least one node from the initial vote is available during the second ballot:

We always satisfy the requirement imposed by rule R2, if we require quorums to be formed by majorities only. Two majorities reach consensus at some point due to overlap

R3 | Choice is Permanent

With constraints R1 and R2 in place, we can order changes to the system’s state, and make sure that information can spread between nodes. However, because information might spread at arbitrary rates, there is one more issue to consider. Let us suppose that node A has proposed a change x=53, which was accepted by a quorum of C, D, E, F. Now, assume that responses from acceptors get delayed, and proposer B issues another proposal x=1 in the meantime. If nodes C,D, E, F all accept B's change, the value committed in the system might have already changed when A receives the acceptance corresponding to its vote.

Therefore, to make the system converge into a unified state, we must impose one more requirement:

R3 | If a proposal with value v is chosen, then every higher numbered proposal issued by any proposer has value v.

Because of rule R2, if a value was committed before, there will always be at least one node with that value in the current quorum. Considering the order requirement from rule R1, we can rephrase R3:

R3 | For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such:
- no acceptor in S accepted any proposal numbered less than n, OR
- V is the value of the highest-numbered proposal among all proposals numbered less than n, accepted by an acceptor in S.

Applying constraint R3 to our example, we can see the system converging into a consistent state:

Once a value is committed, the proposer must stick with it in future ballots. This is how consensus on a value propagates to other nodes, and no inconsistencies can arise.

Proposer Algorithm

Considering constraints R1-R3 of the synod algorithm, we can derive a description of the proposer node, with its operation divided into two stages:

Prepare Phase - a proposer chooses a new proposal number n > lastBallot, updates lastBallot=n, and sends a request Prepare(n) to some set of acceptors, asking to respond with LastVoted(n', v, promise) containing:
- n' - the proposal with the highest number less than n that each acceptor accepted, if any.
- v - the value corresponding to proposal n'.
- promise - a promise not to accept a proposal numbered less than n. This makes sure that n' does not become invalid, between the time when the proposer gets the LastVoted(), and the time it issues a proposal to the acceptor.
If the proposer cannot get the required responses, it will time-out and can retry with another identifier.
Accept Phase - if the proposer receives the requested responses from a majority of the acceptors during the prepare phase, then it can issue a proposal with number n and value v, where:
- v is the value of the highest-numbered proposal among the responses OR
- v is any value selected by the proposer if the responders reported no proposals
The proposer writes the proposal into memory [important for failure recovery], and sends out a Accept(n, v) to all acceptors. When the proposer receives a majority of Voted(n) messages concerning a particular proposal, it records v in its memory. If the proposer times out while waiting for the Voted(n) messages, it can retry by issuing a proposal with a new identifier.

Acceptor Algorithm

The acceptor can receive the Prepare(n) and Accept(n, v) requests from the proposer, and is free to respond or ignore any of them without affecting consistency:

Prepare Phase - the acceptor can receive Prepare(n) with a sequence number n which can fall in one of two ranges:
- If n is greater than that of any Accept() it has already voted for [n > lastAccept], the acceptor updates lastPrepare = n and responds with LastVoted(n', v, promise) which includes:
  - n' - the highest-numbered proposal it accepted n' = lastAccept, or null
  - v - the value corresponding to n'
  - promise - a promise not to accept any requests < n
- If n is smaller than lastAccept on the acceptor, than it should ignore the Prepare() request since it won't vote for a corresponding Accept() request anyways. To improve performance, it can inform the proposer about its state, such that the proposer can abandon the proposal early on.
Accept Phase - the acceptor can accept a proposal numbered n if it has not responded to a Prepare() with a higher identifier in the meantime. This can be ensured if the condition n == lastPrepare is true. If it decides to accept, it records the vote in its memory, updates lastAccepted = n, and sends a Voted(n) message to the proposer and all learners.

Learner Algorithm

After acceptors decide to vote for a proposal, they additionally send a Voted(n) message to all learner nodes. Learners know the number of acceptors in a system, such that they can update their state depending on the number of votes. Due to message loss, there might be cases where learners do not discover a committed value. They can prompt the proposer for a proposal to guarantee that a value was committed in these cases.

Note
Issuing proposals for uncommitted value, commits the value if successful.

Issuing new proposals for an already committed value does not affect the state - consensus is always on value, not on the proposal id.

Complete System

Until now, we have discussed constraints R1-R3 and the function of proposers, acceptors and learners. As a next step, let us look at an example illustrating operation of the Synod consensus, involving proposers A, B, acceptors C, D, E, and a learner F.

Node A proposes a value of x=7, which times-out because it is accepted only by C, with acceptor D down, and E unreachable.
B proposes a value of x=55, but it also times-out since only acceptor E is available.
Proposer A learns from acceptors C and E that the latest value any of them voted for is x=55, which becomes the only value that A can propose now [ see R3]. A proposal for x=55 is issued by A, and is accepted by both C and E. This is picked-up by the learner, which can update its internal state.
Proposer B learns that acceptor D, which just became available, has not voted for anything yet. This does not have much influence on the behavior of the system, since C and E already voted for x=55. Therefore, B can only issue a proposal for x=55, which is accepted by all nodes.

A Paxos node can take on multiple roles, acting as proposer, learner and/or acceptor all at once.

Consistency

Message loss cannot corrupt the state of the system, since it can only prevent an action from happening. Receiving multiple copies of a message can cause an action to be repeated. This also has no effect on the consistency, as sending several Voted(n) messages has the same effect as sending just one. The only way for consistency to be affected is if a proposer has to resend the same change under new ballots because of some failure on its side. However, the proposer prevents this issue by storing the information about lastBallot to memory, so it can recover to its original state.

Progress

The algorithm discussed up to this point ensures consistency of the system’s state. However, we can construct a scenario where two proposers repeatedly send proposals with increasing identifiers. None of those requests is ever chosen, because acceptors will drop any proposal with identifiers lower than the last one received [see R3].

To guarantee progress, we need to ensure that contention cannot occur. One potential solution is to designate a distinguished node as the only proposer in the system. If the distinguished proposer can communicate with most of the acceptors, it will eventually be able to issue a proposal with an identifier high enough to be accepted. To pinpoint which node should act as the proposer, a leader election algorithm can be used. In systems where multiple proposers are needed, we can limit contention using back-off techniques.

From Synod to Paxos

A single run of the Synod algorithm aims to reach consensus on a single value v. Once nodes agree on a certain proposal, the system cannot progress to another value. Paxos generalizes that approach to an arbitrary number of arguments by implementing a sequence of separate instances of the Synod algorithm for each value v_k. To illustrate the operation of the full Paxos algorithm, we start with a system in the following state:

Note
The entry Ek is not the same as the identifier IDk.
- Ek corresponds to the k'th change applied to the state of the system
- Vk / IDk are the latest value and id corresponding to the execution of Ek's synod consensus

Let's suppose node A issues proposals for entries E32, E33, E34, and E35 without awaiting them to complete. Because of several failures, it might happen that E32 and E35 reach consensus, while E33 and E34 time out.

Now, let us imagine that the node A fails and node B is elected as the new proposer. In a distributed state machine, the operation related to the state at entry Ek+1 might depend on all previous entries. We must therefore make sure that all entries up to Ek reached consensus before the proposer can move on to process Ek+1. Therefore, B must synchronize its state up to the last entry, which was seen by the system. To do so, it finds the first entry for which it does not know the status, i.e. E32. To determine the state of the whole system, B executes the prepare phase of the Synod algorithm for all uncertain entries from E32 onwards. Based on the information it receives from acceptors, node B concludes that:

Entry E32 was successfully committed, such that it must be introduced into B's ledger.
For entries E33 and E34 there is no value committed in the system. Because there are committed entries after E34, node B identifies E33 and E34 as gaps and fills in a dummy value into those positions.
Entry E35 was successfully committed and is already available in B's memory.
No values were committed for entries from E36 onwards, which is where B can start processing new requests.

After B has determined the state of the system, it executes the accept phase for entries E32-E35. This brings the system into the following state:

From this point onward, proposer B can issue proposals by executing the accept phase of the Synod algorithm beginning at E36. Since it is the only proposer in the system, it can be sure that execution of the prepare phase is redundant at this stage.

Optimization
A new proposer must query acceptors about the status of all unknown entries. To improve the performance of the system, we could implement a dedicated command for this task. If acceptors can report the state of a range of entries at a time, we reduce the number of needed queries.

Adding/Removing Nodes

When a node joins a cluster, it can initialize its state based on a state dump fetched from centralized storage. It becomes a learner afterwards, synchronizing its state similarly to a newly elected proposer. Removing nodes is trivial and can be seen as node failure.

[1] Lamport, Leslie. “The part-time parliament.” ACM Trans. Comput. Syst. 16 (1998): 133-169.
[2] Lamport, Leslie. “Paxos Made Simple.” (2001).

A Survey on Common MPSoC Simulators

Robert Sander — Mon, 02 Mar 2020 16:33:00 GMT

The increasing complexity of computer architectures can rarely be modeled in an analytical way. Therefore, computer simulation becomes the only viable tool too support design-space exploration, trade-off analysis and performance-forecast during the design process. Such simulations can be very time-consuming, running for weeks or even months to complete. With the introduction of multi-core/many-core systems the situation only got worse: a rising complexity of the simulated target did not go in hand with the performance of the host where the process was executed. In order to reduce the time required to perform an accurate and detailed simulation, several approaches were developed over the years.

Simulator Classification

On the highest level, we can classify simulators based on their scope, level of detail and input type. Looking at the classification based on scope first, two subgroups can be listed:

Full-System Simulators can boot a complete operating system. They support interrupts, IOs, privileged modes and peripheral components. This yields a very detailed representation of the whole software-stack used in modern computers. The provided level of detail does not come without cost however. Full-system simulators are often slow and require device-timing models and large disk-images.
Application-Level Simulators are much simpler and more lightweight compared to full-system ones. They are faster, because workloads are run directly on the simulator without an operating-system in between. If a system-call is performed by an application running on such a tool, it is redirected to the underlying OS for handling. This makes application-level simulators not the best choice for workloads that perform many system-calls during their execution.

The second classification is based on the level of detail obtained by different platforms. We can enumerate 3 major groups here:

Functional Simulators model capabilities of target architectures, and have an emulator-like behavior. They don’t focus on all micro-architectural details, and do not keep track of the timing for individual operations. Functional simulators trade-off the low level of detail for high simulation speeds.
Timing Simulators analyze the whole microarchitecture of a processor in a very detailed manner. Timing, IPC, cache performance and transactions on the interconnect are reported to the user as a result. Depending on the strategy used by a given tool, three subtypes can be pointed out:
- Cycle-level simulators operate at the resolution of the clock cycle. Although very accurate, this type of simulation is extremely slow and requires lots of memory.
- Event-driven approaches are based on events instead of cycles. This makes it possible to avoid simulating periods where no actions are performed in the processor.
- Interval-based simulation employs a strategy where execution is broken down by miss-events in the instruction flow. A simulation of those events is performed first. After that, timing of the resulting intervals is obtained from specialized analytical models. A balance between accuracy and speed can be obtained in effect.
Timing-Functional Simulators integrate the previous approaches to improve on flexibility and accuracy simultaneously, by fully decoupling the timing and functional models. Code instrumentation with Dynamic Binary Translation (DBT) makes it possible to delegate most work on the functional part to the simulation host. The produced output is redirected to the simulation target, where it can interact with the timing model.

The final categorization can be achieved based on the input method:

Trace-Driven Simulators make use of trace-files, which are streams of instructions pre-recorded during execution of a benchmark. Trace-files consume a lot of disk space and are unable to model speculative execution and runtime environment changes. Because of their static nature, results provided by trace-driven simulators are not accurate enough to be used during the analysis of parallel and timing-dependent applications. Simulators in this group are fast, which makes them a good fit for initial design-space exploration.
Execution-Driven Simulators use executables, which are run on the simulation target directly. They don’t require as much disk space as trace-driven simulators, but take a lot longer to terminate. They produce results with much higher accuracy, which makes them a good fit for most applications.

Acceleration Techniques

A rising gap between the performance of hosts executing the simulation and the complexity of the simulated targets could be observed over the years. Combined with the rising requirements imposed by benchmarks, a need for accelerated simulation emerged:

Sampled simulation is performed on a fraction of instructions being representative for the whole benchmark. Instructions are mainly selected by statistical or targeted sampling. During statistical sampling, parts of the benchmark are picked from the instruction stream at random. Targeted sampling employs an initial analysis, which has an influence on the selection process performed afterwards. The calculation of the architectural starting-state between samples is a major challenge of sampled simulations. If points of interest are far apart, then it can be computationally expensive to fast-forward between the associated states.
Statistical simulation collects characteristics of the executed benchmarks at first. The obtained features are used to create instruction traces, which are needed for a subsequent simulation. Although synthetic traces are small in size, they are not accurate enough to make this approach relevant beyond design-space exploration.
Parallel simulation takes advantage of modern multi-core architectures. It splits workloads into multiple threads, which can improve the processing speed significantly. One of the main challenges associated with parallel simulation is the synchronization of data-structures between jobs. Trading-off accuracy for speed, synchronization is often performed after multiple intervals.
Fast-forwarding provides the ability to move the simulation states between calculated points-of-interest. Time can by saved by skipping the non-important parts of the simulated code.

MPSoC Simulators

In the following, we take a closer look at six commonly used computer architecture simulators.

A. Gem5

Gem5 is a modular full-system, application-level simulator with the ability to model multiple instruction sets: ARM, x86, MIPS, SPARC, ALPHA, PPC, RISC-V. It supports in-order, out-of-order pipelines and can be run on Linux, MacOSx, Solaris and OpenBSD. The tool can keep track of events on a cycle-to-cycle basis, which is exploited within the multiple CPU models offered by the platform. Gem5 is highly extendable and allows to specify custom coherence protocols and interconnection networks. It also supports the widest range of IO devices.

B. MARSSx86

MARSSx86 is a cycle-level, full-system simulator. It supports in-order and out-of-order pipelines with detailed configurations. The x86 instruction-set is the only ISA supported by the platform. It includes models of single-core and multicore CPUs with detailed configurations for caches and interconnect. In addition, multi-threaded workloads can be used with the tool as well. Similarly to other platforms presented in this paper, MARSSx86 can use fast-forwarding to accelerate simulation.

C. Graphite / Sniper

Graphite is an application-level, parallel simulator, which can distribute workloads to clusters of servers. It uses integrated timing-functional simulation with timing and functional models decoupled using DBT. The tool supports simulations with hundreds of cores and provides the infrastructure on top of which Sniper was built. The major improvement added to Sniper compared to its predecessor is the notion of interval-based simulation, out-of-order pipelines and support for multilevel, shared cache hierarchies with first-level write-back. It also can work with RISC-V targets, in addition to the x86 support already found in Graphite. Due to the communication overhead added by the shared cache, Sniper is usually slower than its predecessor.

D. Multi2Sim

Multi2Sim is an application-level simulator with support for x86, ARM, AMD Evergreen, Nvidia Fermi and MIPS32 instruction-sets. The simulator integrates a functional-block and a timing-based module. The functional part is used to ensure correctness of execution paths taken in the instruction flow. The timing module is based on analytical models which drive the execution. Multi2Sim can be used to model out-of-order cores, but has no support for IO pipelines. CPU-GPU simulations are the main target of this platform.

E. ZSim

ZSim is an application-level, timing-functional simulator with support for the x86 ISA. Its focus are fast simulations of heterogenous, many-core systems with up to 1000 cores. Architectures are simulated using a two-phase parallelisation algorithm that can scale simulation significantly, without much overhead or loss of precision. Dynamic Binary Translation ensures that most work on the timing model is done once, before execution is started. A lightweight virtualization layer provided by the simulator enables the execution of complex workloads that require functionalities offered by an operating system. ZSim is reported to be 3x faster than Sniper and about 4x faster than MARSSx86 or Gem5.

[1] Carlson, Trevor E. et al. “Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation.”
[2] Miller, Jason E. et al. “Graphite: A distributed parallel simulator for multicores.”
[3] Sánchez, Daniel, Christoforos E. Kozyrakis. “ZSim: fast and accurate microarchitectural simulation of thousand-core systems.”
[4] Binkert, Nathan L. et al. “The gem5 simulator.”
[5] Patel, Avadh et al. “MARSS: A full system simulator for multicore x86 CPUs."
[6] Ubal, Rafael et al. “Multi2Sim: A simulation framework for CPU-GPU computing.”