Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status

05/23/2022
by   Yves Bestgen, et al.
0

This paper argues for the widest possible use of bootstrap confidence intervals for comparing NLP system performances instead of the state-of-the-art status (SOTA) and statistical significance testing. Their main benefits are to draw attention to the difference in performance between two systems and to help assessing the degree of superiority of one system over another. Two cases studies, one comparing several systems and the other based on a K-fold cross-validation procedure, illustrate these benefits. A python module for obtaining these confidence intervals as well as a second function implementing the Fisher-Pitman test for paired samples are freely available on PyPi.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/24/2020

Cross-validation Confidence Intervals for Test Error

This work develops central limit theorems for cross-validation and consi...
research
02/22/2022

Resampling-free bootstrap inference for quantiles

Bootstrap inference is a powerful tool for obtaining robust inference fo...
research
03/06/2023

Estimation of incidence from aggregated current status data

We use historical data about breathlessness in British coal miners and r...
research
12/07/2018

On the lengths of t-based confidence intervals

Given n=mk iid samples from N(θ,σ^2) with θ and σ^2 unknown, we have two...
research
03/31/2023

Homogeneity Tests and Interval Estimations of Risk Differences for Stratified Bilateral and Unilateral Correlated Data

Homogeneity tests and interval estimations of the risk difference betwee...
research
01/12/2023

confidence-planner: Easy-to-Use Prediction Confidence Estimation and Sample Size Planning

Machine learning applications, especially in the fields of me­di­cine an...

Please sign up or login with your details

Forgot password? Click here to reset