Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

by   Zinuo Cai, et al.

Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct responsibilities to handle serving requests, i.e. generalpurpose CPUs for input preprocessing and domain-specific GPUs for forward computation. Recurrent neural networks play an essential role in handling temporal inputs and display distinctive computation characteristics because of their high inter-operator parallelism. Hence, we propose Chrion to optimize recurrent neural network inference by collaboratively utilizing CPUs and GPUs. We formulate the model deployment in the CPU-GPU cluster as an NP-hard scheduling problem of directed acyclic graphs on heterogeneous devices. Given an input model in the ONNX format and user-defined SLO requirement, Chrion firstly preprocesses the model by model parsing and profiling, and then partitions the graph to select execution devices for each operator. When an online request arrives, Chrion performs forward computation according to the graph partition by executing the operators on the CPU and GPU in parallel. Our experimental results show that the execution time can be reduced by 19.4 latency-optimal pattern and GPU memory footprint by 67.5 pattern compared with the execution on the GPU.


page 4

page 11


Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Deep learning (DL) frameworks take advantage of GPUs to improve the spee...

Characterizing and Understanding HGNNs on GPUs

Heterogeneous graph neural networks (HGNNs) deliver powerful capacity in...

FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

The dynamic request patterns of machine learning (ML) inference workload...

Elixir: Train a Large Language Model on a Small GPU Cluster

In recent years, the number of parameters of one deep learning (DL) mode...

At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation

This paper presents GPU performance optimization and scaling results for...

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...

Optimizing Performance of Recurrent Neural Networks on GPUs

As recurrent neural networks become larger and deeper, training times fo...

Please sign up or login with your details

Forgot password? Click here to reset