Fine-Tuning Language Models with Advantage-Induced Policy Alignment

by   Banghua Zhu, et al.

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.


page 1

page 2

page 3

page 4


Aligning Language Models with Offline Reinforcement Learning from Human Feedback

Learning from human preferences is crucial for language models (LMs) to ...

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignm...

Statistical Rejection Sampling Improves Preference Optimization

Improving the alignment of language models with human preferences remain...

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal ro...

Minimax Model Learning

We present a novel off-policy loss function for learning a transition mo...

Efficient RLHF: Reducing the Memory Usage of PPO

Reinforcement Learning with Human Feedback (RLHF) has revolutionized lan...

Stable and low-precision training for large-scale vision-language models

We introduce new methods for 1) accelerating and 2) stabilizing training...

Please sign up or login with your details

Forgot password? Click here to reset