Subgraph Stationary Hardware-Software Inference Co-Design

06/21/2023
by   Payman Behnam, et al.
0

A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency-accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, as well as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy. All of them, however, yield improvements for a single static point in the latency-accuracy tradeoff space. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses (activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherent temporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time. Combined, they are vertically integrated into SUSHI-an inference serving stack. For the stream of queries, SUSHI yields up to 25 improvement in latency, 0.98 to 78.7

READ FULL TEXT
research
05/26/2021

Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Tremendous success of machine learning (ML) and the unabated growth in M...
research
06/03/2019

Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference

Machine learning (ML) has become increasingly important and performance-...
research
02/13/2023

The Framework Tax: Disparities Between Inference Efficiency in Research and Deployment

Increased focus on the deployment of machine learning systems has led to...
research
01/18/2021

Accelerating Deep Learning Inference via Learned Caches

Deep Neural Networks (DNNs) are witnessing increased adoption in multipl...
research
10/26/2022

Desiderata for next generation of ML model serving

Inference is a significant part of ML software infrastructure. Despite t...
research
08/21/2020

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud

We are witnessing an increasing trend towardsusing Machine Learning (ML)...
research
05/30/2019

INFaaS: Managed & Model-less Inference Serving

The number of applications relying on inference from machine learning mo...

Please sign up or login with your details

Forgot password? Click here to reset