Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

01/23/2023
by   Mahyar Emami, et al.
0

The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1–1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers' productivity by speeding design iterations and permitting more exhaustive exploration. One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism. This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks.

READ FULL TEXT

page 3

page 8

page 9

page 13

page 15

page 16

page 17

research
04/07/2020

Worksharing Tasks: An Efficient Way to Exploit Irregular and Fine-Grained Loop Parallelism

Shared memory programming models usually provide worksharing and task co...
research
11/26/2022

Profile-Guided Parallel Task Extraction and Execution for Domain Specific Heterogeneous SoC

In this study, we introduce a methodology for automatically transforming...
research
08/12/2019

MLP Aware Scheduling Techniques in Multithreaded Processors

Major chip manufacturers have all introduced Multithreaded processors. T...
research
10/24/2018

Scheduling computations with provably low synchronization overheads

Work Stealing has been a very successful algorithm for scheduling parall...
research
11/02/2020

IOS: Inter-Operator Scheduler for CNN Acceleration

To accelerate CNN inference, existing deep learning frameworks focus on ...
research
05/09/2019

Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms

Dense linear algebra kernels are critical for wireless applications, and...
research
05/07/2013

EURETILE 2010-2012 summary: first three years of activity of the European Reference Tiled Experiment

This is the summary of first three years of activity of the EURETILE FP7...

Please sign up or login with your details

Forgot password? Click here to reset