Mixed-Precision Embedding Using a Cache

10/21/2020
by   Jie Amy Yang, et al.
0

In recommendation systems, practitioners observed that increase in the number of embedding tables and their sizes often leads to significant improvement in model performances. Given this and the business importance of these models to major internet companies, embedding tables for personalization tasks have grown to terabyte scale and continue to grow at a significant rate. Meanwhile, these large-scale models are often trained with GPUs where high-performance memory is a scarce resource, thus motivating numerous work on embedding table compression during training. We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision, and the most frequently or recently accessed rows cached and trained in full precision. The proposed architectural change works in conjunction with standard precision reduction and computer arithmetic techniques such as quantization and stochastic rounding. For an open source deep learning recommendation model (DLRM) running with Criteo-Kaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache whose size are 5 maintaining accuracy. For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1 embedding tables, while maintaining accuracy, and 16 speedup by reducing GPU-to-host data transfers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2022

A Frequency-aware Software Cache for Large Recommendation System Embeddings

Deep learning recommendation models (DLRMs) have been widely applied in ...
research
08/04/2021

Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000× Compression and 2.7× Faster Inference

Deep learning for recommendation data is the one of the most pervasive a...
research
03/01/2021

High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems

Recommendation models are commonly used learning models that suggest rel...
research
02/18/2022

iMARS: An In-Memory-Computing Architecture for Recommendation Systems

Recommendation systems (RecSys) suggest items to users by predicting the...
research
05/26/2023

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Large language models(LLMs) have sparked a new wave of exciting AI appli...
research
10/12/2022

Clustering Embedding Tables, Without First Learning Them

To work with categorical features, machine learning systems employ embed...
research
07/21/2022

The trade-offs of model size in large recommendation models : A 10000 × compressed criteo-tb DLRM model (100 GB parameters to mere 10MB)

Embedding tables dominate industrial-scale recommendation model sizes, u...

Please sign up or login with your details

Forgot password? Click here to reset