Dataset vs Reality: Understanding Model Performance from the Perspective of Information Need

12/06/2022
by   Mengying Yu, et al.
0

Deep learning technologies have brought us many models that outperform human beings on a few benchmarks. An interesting question is: can these models well solve real-world problems with similar settings (e.g., same input/output) to the benchmark datasets? We argue that a model is trained to answer the same information need for which the training dataset is created. Although some datasets may share high structural similarities, e.g., question-answer pairs for the question answering (QA) task and image-caption pairs for the image captioning (IC) task, not all datasets are created for the same information need. To support our argument, we conduct a comprehensive analysis on widely used benchmark datasets for both QA and IC tasks. We compare the dataset creation process (e.g., crowdsourced, or collected data from real users or content providers) from the perspective of information need in the context of information retrieval. To show the differences between datasets, we perform both word-level and sentence-level analysis. We show that data collected from real users or content providers tend to have richer, more diverse, and more specific words than data annotated by crowdworkers. At sentence level, data by crowdworkers share similar dependency distributions and higher similarities in sentence structure, compared to data collected from content providers. We believe our findings could partially explain why some datasets are considered more challenging than others, for similar tasks. Our findings may also be helpful in guiding new dataset construction.

READ FULL TEXT
research
04/26/2021

GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval

A major challenge of research on non-English machine reading for questio...
research
06/20/2022

SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts

Question answering (QA) systems have gained explosive attention in recen...
research
03/10/2020

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

Confidently making progress on multilingual modeling requires challengin...
research
09/16/2022

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

We present a new task and dataset, ScreenQA, for screen content understa...
research
11/09/2020

CapWAP: Captioning with a Purpose

The traditional image captioning task uses generic reference captions to...
research
04/21/2020

Have you forgotten? A method to assess if machine learning models have forgotten data

In the era of deep learning, aggregation of data from several sources is...
research
06/06/2020

Truthful Data Acquisition via Peer Prediction

We consider the problem of purchasing data for machine learning or stati...

Please sign up or login with your details

Forgot password? Click here to reset