Pretraining Language Models with Human Preferences

02/16/2023
by   Tomasz Korbak, et al.
0

Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.

READ FULL TEXT
research
09/28/2022

Downstream Datasets Make Surprisingly Good Pretraining Corpora

For most natural language processing tasks, the dominant practice is to ...
research
12/01/2021

A General Language Assistant as a Laboratory for Alignment

Given the broad capabilities of large language models, it should be poss...
research
02/05/2020

Aligning the Pretraining and Finetuning Objectives of Language Models

We demonstrate that explicitly aligning the pretraining objectives to th...
research
09/06/2023

Everyone Deserves A Reward: Learning Customized Human Preferences

Reward models (RMs) are crucial in aligning large language models (LLMs)...
research
11/10/2022

Nano: Nested Human-in-the-Loop Reward Learning for Few-shot Language Model Control

Pretrained language models have demonstrated extraordinary capabilities ...
research
05/18/2023

LIMA: Less Is More for Alignment

Large language models are trained in two stages: (1) unsupervised pretra...
research
05/22/2023

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, Toxicity

Pretraining is the preliminary and fundamental step in developing capabl...

Please sign up or login with your details

Forgot password? Click here to reset