Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

by   Sangeetha Siddegowda, et al.

While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additional noise it induces can lead to accuracy degradation. In this white paper, we present an overview of neural network quantization using AI Model Efficiency Toolkit (AIMET). AIMET is a library of state-of-the-art quantization and compression algorithms designed to ease the effort required for model optimization and thus drive the broader AI ecosystem towards low latency and energy-efficient inference. AIMET provides users with the ability to simulate as well as optimize PyTorch and TensorFlow models. Specifically for quantization, AIMET includes various post-training quantization (PTQ, cf. chapter 4) and quantization-aware training (QAT, cf. chapter 5) techniques that guarantee near floating-point accuracy for 8-bit fixed-point inference. We provide a practical guide to quantization via AIMET by covering PTQ and QAT workflows, code examples and practical tips that enable users to efficiently and effectively quantize models using AIMET and reap the benefits of low-bit integer inference.


page 7

page 9

page 10

page 13

page 16

page 20

page 28

page 33


A White Paper on Neural Network Quantization

While neural networks have advanced the frontiers in many applications, ...

HAWQV3: Dyadic Neural Network Quantization

Quantization is one of the key techniques used to make Neural Networks (...

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Quantization techniques can reduce the size of Deep Neural Networks and ...

SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision

The latest industrial inference engines, such as FasterTransformer1 and ...

Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

This paper presents an optimized methodology to design and deploy Speech...

Efficient Integer-Arithmetic-Only Convolutional Neural Networks

Integer-arithmetic-only networks have been demonstrated effective to red...

Quantization Backdoors to Deep Learning Models

There is currently a burgeoning demand for deploying deep learning (DL) ...

Please sign up or login with your details

Forgot password? Click here to reset