We test the hypothesis that language models trained with reinforcement
l...
As AI systems become more capable, we would like to enlist their help to...
"Induction heads" are attention heads that implement a simple algorithm ...
Neural networks often pack many unrelated concepts into a single neuron ...
We describe our early efforts to red team language models in order to
si...
We study whether language models can evaluate the validity of their own
...
Recent large language models have been trained on vast datasets, but als...
We apply preference modeling and reinforcement learning from human feedb...
Large-scale pre-training has recently emerged as a technique for creatin...
Given the broad capabilities of large language models, it should be poss...
On April 13th, 2019, OpenAI Five became the first AI system to defeat th...
We propose a rejection sampling scheme using the discriminator of a GAN ...
We introduce a two-player contest for evaluating the safety and robustne...
We explore a new way to evaluate generative models using insights from
e...
Recent work (Pennington et al, 2017) suggests that controlling the entir...