The OoO VLIW JIT Compiler for GPU Inference

01/28/2019
by   Paras Jain, et al.
1

Current trends in Machine Learning (ML) inference on hardware accelerated devices (e.g., GPUs, TPUs) point to alarmingly low utilization. As ML inference is increasingly time-bounded by tight latency SLOs, increasing data parallelism is not an option. The need for better efficiency motivates GPU multiplexing. Furthermore, existing GPU programming abstractions force programmers to micro-manage GPU resources in an early-binding, context-free fashion. We propose a VLIW-inspired Out-of-Order (OoO) Just-in-Time (JIT) compiler that coalesces and reorders execution kernels at runtime for throughput-optimal device utilization while satisfying latency SLOs. We quantify the inefficiencies of space-only and time-only multiplexing alternatives and demonstrate an achievable 7.7x opportunity gap through spatial coalescing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/01/2021

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applic...
research
06/06/2021

Experience Report: Writing A Portable GPU Runtime with OpenMP 5.1

GPU runtimes are historically implemented in CUDA or other vendor specif...
research
12/31/2018

Dynamic Space-Time Scheduling for GPU Inference

Serving deep neural networks in latency critical interactive settings of...
research
01/26/2023

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

On-device machine learning (ML) inference can enable the use of private ...
research
10/27/2021

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

The rapid development in computing technology has paved the way for dire...
research
12/15/2022

Kernel-as-a-Service: A Serverless Interface to GPUs

Serverless computing has made it easier than ever to deploy applications...
research
05/22/2023

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Machine Learning (ML) models contain highly-parallel computations, such ...

Please sign up or login with your details

Forgot password? Click here to reset