Anchor Points: Benchmarking Models with Much Fewer Examples

09/14/2023
by   Rajan Vivek, et al.
0

Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Anchor points reliably rank models: across 87 diverse language model-prompt pairs, evaluating models using 1-30 anchor points outperforms uniform sampling and other baselines at accurately ranking models. Moreover, just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error, sufficient for gauging where the model is likely to fail. Lastly, we present Anchor Point Maps for visualizing these insights and facilitating comparisons of the performance of different models on various regions within the dataset distribution.

READ FULL TEXT

page 5

page 11

page 15

page 20

research
12/16/2022

Instance-specific Label Distribution Regularization for Learning with Label Noise

Modeling noise transition matrix is a kind of promising method for learn...
research
03/03/2021

Statistical Hypothesis Testing for Class-Conditional Label Noise

In this work we aim to provide machine learning practitioners with tools...
research
10/19/2021

Fully Three-dimensional Radial Visualization

We develop methodology for three-dimensional (3D) radial visualization (...
research
06/18/2019

Deep Active Learning for Anchor User Prediction

Predicting pairs of anchor users plays an important role in the cross-ne...
research
03/29/2023

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Language model probing is often used to test specific capabilities of th...
research
10/20/2022

Uni6Dv3: 5D Anchor Mechanism for 6D Pose Estimation

Unlike indirect methods that usually require time-consuming post-process...
research
11/24/2013

Local Similarities, Global Coding: An Algorithm for Feature Coding and its Applications

Data coding as a building block of several image processing algorithms h...

Please sign up or login with your details

Forgot password? Click here to reset