FP8 Quantization: The Power of the Exponent

by   Andrey Kuzmin, et al.

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and the number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.


page 1

page 2

page 3

page 4


Representation range needs for 16-bit neural network training

Deep learning has grown rapidly thanks to its state-of-the-art performan...

Template-Based Posit Multiplication for Training and Inferring in Neural Networks

The posit number system is arguably the most promising and discussed top...

FP8 versus INT8 for efficient deep learning inference

Recently, the idea of using FP8 as a number format for neural network tr...

FP8 Formats for Deep Learning

FP8 is a natural progression for accelerating deep learning training inf...

Low-Cost Floating-Point Processing in ReRAM for Scientific Computing

We propose ReFloat, a principled approach for low-cost floating-point pr...

A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN)

Numerous studies have investigated the effectiveness of neural network q...

Block Format Error Bounds and Optimal Block Size Selection

The amounts of data that need to be transmitted, processed, and stored b...

Please sign up or login with your details

Forgot password? Click here to reset