FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

by   Ying Sheng, et al.
ETH Zurich
Carnegie Mellon University
berkeley college
Stanford University

The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen


page 1

page 2

page 3

page 4


DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

The past several years have witnessed the success of transformer-based m...

S^3: Increasing GPU Utilization during Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amou...

Typesafe Coordinate Systems in High-Throughput Sequencing Applications

High-throughput sequencing file formats and tools encode coordinate inte...

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Large Language Models (LLMs), despite their recent impressive accomplish...

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Large language models(LLMs) have sparked a new wave of exciting AI appli...

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases - p...

ThundeRiNG: Generating Multiple Independent Random Number Sequences on FPGAs

In this paper, we propose ThundeRiNG, a resource-efficient and high-thro...

Please sign up or login with your details

Forgot password? Click here to reset