Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

by   Zhaozhuo Xu, et al.

Large Language Models (LLMs), armed with billions of parameters, exhibit exceptional performance across a wide range of Natural Language Processing (NLP) tasks. However, they present a significant computational challenge during inference, especially when deploying on common hardware such as single GPUs. As such, minimizing the latency of LLM inference by curtailing computational and memory requirements, though achieved through compression, becomes critically important. However, this process inevitably instigates a trade-off between efficiency and accuracy, as compressed LLMs typically experience a reduction in predictive precision. In this research, we introduce an innovative perspective: to optimize this trade-off, compressed LLMs require a unique input format that varies from that of the original models. Our findings indicate that the generation quality in a compressed LLM can be markedly improved for specific queries by selecting prompts with precision. Capitalizing on this insight, we introduce a prompt learning paradigm that cultivates an additive prompt over a compressed LLM to bolster their accuracy. Our empirical results imply that through our strategic prompt utilization, compressed LLMs can match, and occasionally even exceed, the accuracy of the original models. Moreover, we demonstrated that these learned prompts have a certain degree of transferability across various datasets, tasks, and compression levels. These insights shine a light on new possibilities for enhancing the balance between accuracy and efficiency in LLM inference. Specifically, they underscore the importance of judicious input editing to a compressed large model, hinting at potential advancements in scaling LLMs on common hardware.


page 10

page 11

page 18


Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Since hardware resources are limited, the objective of training deep lea...

Robust Partially-Compressed Least-Squares

Randomized matrix compression techniques, such as the Johnson-Lindenstra...

CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models

Parameter-efficient tuning (PET) has been widely explored in recent year...

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

The recent advance of self-supervised learning associated with the Trans...

A Survey on Model Compression for Large Language Models

Large Language Models (LLMs) have revolutionized natural language proces...

Semantic Compression With Large Language Models

The rise of large language models (LLMs) is revolutionizing information ...

Structural Dropout for Model Width Compression

Existing ML models are known to be highly over-parametrized, and use sig...

Please sign up or login with your details

Forgot password? Click here to reset