Efficient Quantized Sparse Matrix Operations on Tensor Cores

09/14/2022
by   Shigang Li, et al.
13

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.

READ FULL TEXT

page 4

page 5

page 6

page 8

page 9

page 11

research
03/11/2018

NVIDIA Tensor Core Programmability, Performance & Precision

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, ca...
research
09/19/2023

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

With the fast growth of parameter size, it becomes increasingly challeng...
research
06/23/2021

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been ...
research
09/22/2022

Computing Double Precision Euclidean Distances using GPU Tensor Cores

Tensor cores (TCs) are a type of Application-Specific Integrated Circuit...
research
10/23/2020

Not Half Bad: Exploring Half-Precision in Graph Convolutional Neural Networks

With the growing significance of graphs as an effective representation o...
research
02/11/2022

Learning from distinctive candidates to optimize reduced-precision convolution program on tensor cores

Convolution is one of the fundamental operations of deep neural networks...
research
04/17/2023

Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix Acceleration in Tensorflow Lite

In this paper, we present a dynamically reconfigurable hardware accelera...

Please sign up or login with your details

Forgot password? Click here to reset