Building an Evaluation Scale using Item Response Theory

05/28/2016
by   John P. Lalor, et al.
0

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2021

Face Identification Proficiency Test Designed Using Item Response Theory

Measures of face identification proficiency are essential to ensure accu...
research
08/29/2019

Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds

Incorporating Item Response Theory (IRT) into NLP tasks can provide valu...
research
10/20/2021

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...
research
06/18/2023

Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective

Large language models (LLMs), like ChatGPT, have shown some human-like c...
research
02/27/2017

CIFT: Crowd-Informed Fine-Tuning to Improve Machine Learning Ability

Item Response Theory (IRT) allows for measuring ability of Machine Learn...
research
07/29/2023

Comprehensive Algorithm Portfolio Evaluation using Item Response Theory

Item Response Theory (IRT) has been proposed within the field of Educati...
research
11/01/2021

Using Synthetic Images To Uncover Population Biases In Facial Landmarks Detection

In order to analyze a trained model performance and identify its weak sp...

Please sign up or login with your details

Forgot password? Click here to reset