The ExaNeSt Prototype: Evaluation of Efficient HPC Communication Hardware in an ARM-based Multi-FPGA Rack

by   Manolis Ploumidis, et al.

We present and evaluate the ExaNeSt Prototype, a liquid-cooled rack prototype consisting of 256 Xilinx ZU9EG MPSoCs, 4 TBytes of DRAM, 16 TBytes of SSD, and configurable interconnection 10-Gbps hardware. We developed this testbed in 2016-2019 to validate the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accelerators in the quest towards Exascale systems and beyond. We present our key design choices reagrding overall system architecture, PCBs and runtime software, and summarize insights resulting from measurement and analysis. Of particular note, our custom interconnect includes a low-cost low-latency network interface, offering user-level zero-copy RDMA, which we have tightly coupled with the ARMv8 processors in the MPSoCs. We have developed a system software runtime on top of these features, and have been able to run MPI. We have evaluated our testbed through MPI microbenchmarks, mini, and full MPI applications. Single hop, one way latency is 1.3 μs; approximately 0.47 μs out of these are attributed to network interface and the user-space library that exposes its functionality to the runtime. Latency over longer paths increases as expected, reaching 2.55 μs for a five-hop path. Bandwidth tests show that, for a single hop, link utilization reaches 82% of the theoretical capacity. Microbenchmarks based on MPI collectives reveal that broadcast latency scales as expected when the number of participating ranks increases. We also implemented a custom Allreduce accelerator in the network interface, which reduces the latency of such collectives by up to 88%. We assess performance scaling through weak and strong scaling tests for HPCG, LAMMPS, and the miniFE mini application; for all these tests, parallelization efficiency is at least 69%, or better.


page 10

page 11

page 14

page 17

page 18

page 23

page 25

page 32


Callback-based Completion Notification using MPI Continuations

Asynchronous programming models (APM) are gaining more and more traction...

Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks

While FPGA accelerator boards and their respective high-level design too...

Checkpoint/restart approaches for a thread-based MPI runtime

Fault-tolerance has always been an important topic when it comes to runn...

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed

This paper assesses and reports the experience of ten teams working to p...

Optimizing the hybrid parallelization of BHAC

We present our experience with the modernization on the GR-MHD code BHAC...

Developing a functional prototype master patient index (MPI) for interoperability of e-health systems in Sri Lanka

Introduction: A Master Patient Index(MPI) is a centralized index of all ...

Extreme Software Defined Radio – GHz in Real Time

Software defined radio is a widely accepted paradigm for design of recon...

Please sign up or login with your details

Forgot password? Click here to reset