Fast Arbitrary Precision Floating Point on FPGA

04/13/2022
by   Johannes de Fine Licht, et al.
0

Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.

READ FULL TEXT

page 1

page 2

page 6

research
01/14/2019

Faster arbitrary-precision dot product and matrix multiplication

We present algorithms for real and complex dot product and matrix multip...
research
06/07/2023

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

General Matrix Multiplication (GEMM) is a fundamental operation widely u...
research
09/22/2020

A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

Sparse matrix-vector multiplication is often employed in many data-analy...
research
06/21/2018

Generic and Universal Parallel Matrix Summation with a Flexible Compression Goal for Xilinx FPGAs

Bit matrix compression is a highly relevant operation in computer arithm...
research
11/30/2021

PERCIVAL: Open-Source Posit RISC-V Core with Quire Capability

The posit representation for real numbers is an alternative to the ubiqu...
research
05/12/2017

CLBlast: A Tuned OpenCL BLAS Library

This work demonstrates how to accelerate dense linear algebra computatio...
research
03/10/2020

Parallel Robust Computation of Generalized Eigenvectors of Matrix Pencils

In this paper we consider the problem of computing generalized eigenvect...

Please sign up or login with your details

Forgot password? Click here to reset