Correlation and Prediction of Evaluation Metrics in Information Retrieval

02/01/2018
by   Mucahid Kutlu, et al.
0

Because researchers typically do not have the time or space to present more than a few evaluation metrics in any published study, it can be difficult to assess relative effectiveness of prior methods for unreported metrics when baselining a new method or conducting a systematic meta-review. While sharing of study data would help alleviate this, recent attempts to encourage consistent sharing have been largely unsuccessful. Instead, we propose to enable relative comparisons with prior work across arbitrary metrics by predicting unreported metrics given one or more reported metrics. In addition, we further investigate prediction of high-cost evaluation measures using low-cost measures as a potential strategy for reducing evaluation cost. We begin by assessing the correlation between 23 IR metrics using 8 TREC test collections. Measuring prediction error wrt. R-square and Kendall's tau, we show that accurate prediction of MAP, P@10, and RBP can be achieved using only 2-3 other metrics. With regard to lowering evaluation cost, we show that RBP(p=0.95) can be predicted with high accuracy using measures with only evaluation depth of 30. Taken together, our findings provide a valuable proof-of-concept which we expect to spur follow-on work by others in proposing more sophisticated models for metric prediction.

READ FULL TEXT
research
04/27/2021

Meta-evaluation of Conversational Search Evaluation Metrics

Conversational search systems, such as Google Assistant and Microsoft Co...
research
07/07/2022

On the Metric Properties of IR Evaluation Measures Based on Ranking Axioms

The axiomatic analysis of IR evaluation metrics has contributed to a bet...
research
01/19/2023

New Metrics to Encourage Innovation and Diversity in Information Retrieval Approaches

In evaluation campaigns, participants often explore variations of popula...
research
01/10/2023

Assessing the applicability of common performance metrics for real-world infrared small-target detection

Infrared small target detection (IRSTD) is a challenging task in compute...
research
03/07/2023

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Recently, the emergence of ChatGPT has attracted wide attention from the...
research
03/02/2020

Assessing Software Defection Prediction Performance: Why Using the Matthews Correlation Coefficient Matters

Context: There is considerable diversity in the range and design of comp...
research
01/07/2019

Evaluating software defect prediction performance: an updated benchmarking study

Accurately predicting faulty software units helps practitioners target f...

Please sign up or login with your details

Forgot password? Click here to reset