FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

03/13/2023
by   Ying Sheng, et al.
0

The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2022

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

The past several years have witnessed the success of transformer-based m...
research
06/09/2023

S^3: Increasing GPU Utilization during Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amou...
research
09/14/2022

Typesafe Coordinate Systems in High-Throughput Sequencing Applications

High-throughput sequencing file formats and tools encode coordinate inte...
research
06/24/2023

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Large Language Models (LLMs), despite their recent impressive accomplish...
research
05/26/2023

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Large language models(LLMs) have sparked a new wave of exciting AI appli...
research
08/31/2023

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases - p...
research
05/20/2021

ThundeRiNG: Generating Multiple Independent Random Number Sequences on FPGAs

In this paper, we propose ThundeRiNG, a resource-efficient and high-thro...

Please sign up or login with your details

Forgot password? Click here to reset