Accelerating Deep Learning Inference via Learned Caches

01/18/2021
by   Arjun Balasubramanian, et al.
56

Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. However, this high accuracy has been achieved by building deeper networks, posing a fundamental challenge to the low latency inference desired by user-facing applications. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We observe that caching hidden layer outputs of the DNN can introduce a form of late-binding where inference requests only consume the amount of computation needed. This enables a mechanism for achieving low latencies, coupled with an ability to exploit temporal locality. However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference. Results show that GATI can reduce inference latency by up to 7.69X on realistic workloads.

READ FULL TEXT
research
02/07/2020

Accelerating Deep Learning Inference via Freezing

Over the last few years, Deep Neural Networks (DNNs) have become ubiquit...
research
07/03/2020

CacheNet: A Model Caching Framework for Deep Learning Inference on the Edge

The success of deep neural networks (DNN) in machine perception applicat...
research
06/03/2020

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

Machine learning inference is becoming a core building block for interac...
research
09/18/2022

Improving the Performance of DNN-based Software Services using Automated Layer Caching

Deep Neural Networks (DNNs) have become an essential component in many a...
research
06/21/2023

Subgraph Stationary Hardware-Software Inference Co-Design

A growing number of applications depend on Machine Learning (ML) functio...
research
11/24/2018

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments

Deep neural networks (DNNs) have become core computation components with...
research
08/31/2022

Orloj: Predictably Serving Unpredictable DNNs

Existing DNN serving solutions can provide tight latency SLOs while main...

Please sign up or login with your details

Forgot password? Click here to reset