Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

08/30/2023
by   Hritik Bansal, et al.
0

Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60 annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. To our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.

READ FULL TEXT
research
05/29/2023

Large Language Models are not Fair Evaluators

We uncover a systematic bias in the evaluation paradigm of adopting larg...
research
01/23/2023

Modeling Preferences: A Bayesian Mixture of Finite Mixtures for Rankings and Ratings

Rankings and ratings are commonly used to express preferences but provid...
research
02/06/2023

Languages are Rewards: Chain of Hindsight Finetuning using Human Feedback

Learning from human preferences is important for language models to be h...
research
06/09/2023

Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics and Prompt Wording

Large language models (LLMs) have become mainstream technology with thei...
research
05/24/2022

How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

Human ratings are treated as the gold standard in NLG evaluation. The st...
research
04/01/2023

Large language models can rate news outlet credibility

Although large language models (LLMs) have shown exceptional performance...
research
10/10/2019

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)

We present a recurrent neural network based system for automatic quality...

Please sign up or login with your details

Forgot password? Click here to reset