A Survey on Common MPSoC Simulators

2 Mar 2020 — 6 min read

The increasing complexity of computer architectures can rarely be modeled in an analytical way. Therefore, computer simulation becomes the only viable tool too support design-space exploration, trade-off analysis and performance-forecast during the design process. Such simulations can be very time-consuming, running for weeks or even months to complete. With the introduction of multi-core/many-core systems the situation only got worse: a rising complexity of the simulated target did not go in hand with the performance of the host where the process was executed. In order to reduce the time required to perform an accurate and detailed simulation, several approaches were developed over the years.

Simulator Classification

On the highest level, we can classify simulators based on their scope, level of detail and input type. Looking at the classification based on scope first, two subgroups can be listed:

Full-System Simulators can boot a complete operating system. They support interrupts, IOs, privileged modes and peripheral components. This yields a very detailed representation of the whole software-stack used in modern computers. The provided level of detail does not come without cost however. Full-system simulators are often slow and require device-timing models and large disk-images.
Application-Level Simulators are much simpler and more lightweight compared to full-system ones. They are faster, because workloads are run directly on the simulator without an operating-system in between. If a system-call is performed by an application running on such a tool, it is redirected to the underlying OS for handling. This makes application-level simulators not the best choice for workloads that perform many system-calls during their execution.

The second classification is based on the level of detail obtained by different platforms. We can enumerate 3 major groups here:

Functional Simulators model capabilities of target architectures, and have an emulator-like behavior. They don’t focus on all micro-architectural details, and do not keep track of the timing for individual operations. Functional simulators trade-off the low level of detail for high simulation speeds.
Timing Simulators analyze the whole microarchitecture of a processor in a very detailed manner. Timing, IPC, cache performance and transactions on the interconnect are reported to the user as a result. Depending on the strategy used by a given tool, three subtypes can be pointed out:
- Cycle-level simulators operate at the resolution of the clock cycle. Although very accurate, this type of simulation is extremely slow and requires lots of memory.
- Event-driven approaches are based on events instead of cycles. This makes it possible to avoid simulating periods where no actions are performed in the processor.
- Interval-based simulation employs a strategy where execution is broken down by miss-events in the instruction flow. A simulation of those events is performed first. After that, timing of the resulting intervals is obtained from specialized analytical models. A balance between accuracy and speed can be obtained in effect.
Timing-Functional Simulators integrate the previous approaches to improve on flexibility and accuracy simultaneously, by fully decoupling the timing and functional models. Code instrumentation with Dynamic Binary Translation (DBT) makes it possible to delegate most work on the functional part to the simulation host. The produced output is redirected to the simulation target, where it can interact with the timing model.

The final categorization can be achieved based on the input method:

Trace-Driven Simulators make use of trace-files, which are streams of instructions pre-recorded during execution of a benchmark. Trace-files consume a lot of disk space and are unable to model speculative execution and runtime environment changes. Because of their static nature, results provided by trace-driven simulators are not accurate enough to be used during the analysis of parallel and timing-dependent applications. Simulators in this group are fast, which makes them a good fit for initial design-space exploration.
Execution-Driven Simulators use executables, which are run on the simulation target directly. They don’t require as much disk space as trace-driven simulators, but take a lot longer to terminate. They produce results with much higher accuracy, which makes them a good fit for most applications.

Acceleration Techniques

A rising gap between the performance of hosts executing the simulation and the complexity of the simulated targets could be observed over the years. Combined with the rising requirements imposed by benchmarks, a need for accelerated simulation emerged:

Sampled simulation is performed on a fraction of instructions being representative for the whole benchmark. Instructions are mainly selected by statistical or targeted sampling. During statistical sampling, parts of the benchmark are picked from the instruction stream at random. Targeted sampling employs an initial analysis, which has an influence on the selection process performed afterwards. The calculation of the architectural starting-state between samples is a major challenge of sampled simulations. If points of interest are far apart, then it can be computationally expensive to fast-forward between the associated states.
Statistical simulation collects characteristics of the executed benchmarks at first. The obtained features are used to create instruction traces, which are needed for a subsequent simulation. Although synthetic traces are small in size, they are not accurate enough to make this approach relevant beyond design-space exploration.
Parallel simulation takes advantage of modern multi-core architectures. It splits workloads into multiple threads, which can improve the processing speed significantly. One of the main challenges associated with parallel simulation is the synchronization of data-structures between jobs. Trading-off accuracy for speed, synchronization is often performed after multiple intervals.
Fast-forwarding provides the ability to move the simulation states between calculated points-of-interest. Time can by saved by skipping the non-important parts of the simulated code.

MPSoC Simulators

In the following, we take a closer look at six commonly used computer architecture simulators.

A. Gem5

Gem5 is a modular full-system, application-level simulator with the ability to model multiple instruction sets: ARM, x86, MIPS, SPARC, ALPHA, PPC, RISC-V. It supports in-order, out-of-order pipelines and can be run on Linux, MacOSx, Solaris and OpenBSD. The tool can keep track of events on a cycle-to-cycle basis, which is exploited within the multiple CPU models offered by the platform. Gem5 is highly extendable and allows to specify custom coherence protocols and interconnection networks. It also supports the widest range of IO devices.

B. MARSSx86

MARSSx86 is a cycle-level, full-system simulator. It supports in-order and out-of-order pipelines with detailed configurations. The x86 instruction-set is the only ISA supported by the platform. It includes models of single-core and multicore CPUs with detailed configurations for caches and interconnect. In addition, multi-threaded workloads can be used with the tool as well. Similarly to other platforms presented in this paper, MARSSx86 can use fast-forwarding to accelerate simulation.

C. Graphite / Sniper

Graphite is an application-level, parallel simulator, which can distribute workloads to clusters of servers. It uses integrated timing-functional simulation with timing and functional models decoupled using DBT. The tool supports simulations with hundreds of cores and provides the infrastructure on top of which Sniper was built. The major improvement added to Sniper compared to its predecessor is the notion of interval-based simulation, out-of-order pipelines and support for multilevel, shared cache hierarchies with first-level write-back. It also can work with RISC-V targets, in addition to the x86 support already found in Graphite. Due to the communication overhead added by the shared cache, Sniper is usually slower than its predecessor.

D. Multi2Sim

Multi2Sim is an application-level simulator with support for x86, ARM, AMD Evergreen, Nvidia Fermi and MIPS32 instruction-sets. The simulator integrates a functional-block and a timing-based module. The functional part is used to ensure correctness of execution paths taken in the instruction flow. The timing module is based on analytical models which drive the execution. Multi2Sim can be used to model out-of-order cores, but has no support for IO pipelines. CPU-GPU simulations are the main target of this platform.

E. ZSim

ZSim is an application-level, timing-functional simulator with support for the x86 ISA. Its focus are fast simulations of heterogenous, many-core systems with up to 1000 cores. Architectures are simulated using a two-phase parallelisation algorithm that can scale simulation significantly, without much overhead or loss of precision. Dynamic Binary Translation ensures that most work on the timing model is done once, before execution is started. A lightweight virtualization layer provided by the simulator enables the execution of complex workloads that require functionalities offered by an operating system. ZSim is reported to be 3x faster than Sniper and about 4x faster than MARSSx86 or Gem5.

[1] Carlson, Trevor E. et al. “Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation.”
[2] Miller, Jason E. et al. “Graphite: A distributed parallel simulator for multicores.”
[3] Sánchez, Daniel, Christoforos E. Kozyrakis. “ZSim: fast and accurate microarchitectural simulation of thousand-core systems.”
[4] Binkert, Nathan L. et al. “The gem5 simulator.”
[5] Patel, Avadh et al. “MARSS: A full system simulator for multicore x86 CPUs."
[6] Ubal, Rafael et al. “Multi2Sim: A simulation framework for CPU-GPU computing.”