An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

06/28/2023
by   Haihao Shen, et al.
0

In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70 to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

READ FULL TEXT
research
10/23/2020

LightSeq: A High Performance Inference Library for Transformers

Transformer, BERT and their variants have achieved great success in natu...
research
07/31/2020

Language Modelling for Source Code with Transformer-XL

It has been found that software, like natural language texts, exhibits "...
research
05/21/2023

Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

The prevalence of Transformer-based pre-trained language models (PLMs) h...
research
08/12/2022

An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

The Transformer has been an indispensable staple in deep learning. Howev...
research
04/14/2021

Demystifying BERT: Implications for Accelerator Design

Transfer learning in natural language processing (NLP), as realized usin...
research
06/22/2022

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

Transformers have become a predominant machine learning workload, they a...
research
03/31/2023

Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval

Vector-based retrieval systems have become a common staple for academic ...

Please sign up or login with your details

Forgot password? Click here to reset