Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures

by   Matti Karppa, et al.

We study the problem of multiplying two bit matrices with entries either over the Boolean algebra (0,1,∨,∧) or over the binary field (0,1,+,·). We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input sizes close to the available shared memory capacity. For example, given two terabinary-bit square matrices as input, our implementations compute the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW). Our contributions are (a) for the binary product, we use alternative-basis techniques of Karstadt and Schwartz [SPAA '17] to design novel alternative-basis variants of Strassen's recurrence for 2× 2 block multiplication [Numer. Math. 13 (1969)] that have been optimized for both the number of additions and low working memory, (b) structuring the parallel block recurrences and the memory layout for coalescent and register-localized execution on accelerator hardware, (c) low-level engineering of the innermost block products for the specific target hardware, and (d) structuring the top-level shared-memory implementation to feed the accelerators with data and integrate the results for input and output sizes beyond the aggregate memory capacity of the available accelerators.


page 1

page 2

page 3

page 4


TMA: Tera-MACs/W Neural Hardware Inference Accelerator with a Multiplier-less Massive Parallel Processor

Computationally intensive Inference tasks of Deep neural networks have e...

Fast matrix multiplication for binary and ternary CNNs on ARM CPU

Low-bit quantized neural networks are of great interest in practical app...

Flexagon: A Multi-Dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing

Sparsity is a growing trend in modern DNN models. Existing Sparse-Sparse...

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

There is a growing interest in custom spatial accelerators for machine l...

SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators

Virtual memory (VM) is critical to the usability and programmability of ...

Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs

The configurable building blocks of current FPGAs – Logic blocks (LBs), ...

DGEMM performance is data-dependent

The DGEMM function is a widely used implementation of the matrix product...

Please sign up or login with your details

Forgot password? Click here to reset