Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents? (CORRECTED VERSION)

11/02/2022
by   Tetsuya Sakai, et al.
0

In the context of depth-k pooling for constructing web search test collections, we compare two approaches to ordering pooled documents for relevance assessors: the prioritisation strategy (PRI) used widely at NTCIR, and the simple randomisation strategy (RND). In order to address research questions regarding PRI and RND, we have constructed and released the WWW3E8 data set, which contains eight independent relevance labels for 32,375 topic-document pairs, i.e., a total of 259,000 labels. Four of the eight relevance labels were obtained from PRI-based pools; the other four were obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of inter-assessor agreement, system ranking agreement, and robustness to new systems that did not contribute to the pools. We also utilise an assessor activity log we obtained as a byproduct of WWW3E8 to compare the two strategies in terms of assessment efficiency.

READ FULL TEXT

page 7

page 9

page 16

research
03/10/2022

Evaluating Elements of Web-based Data Enrichment for Pseudo-Relevance Feedback Retrieval

In this work, we analyze a pseudo-relevance retrieval method based on th...
research
08/14/2022

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

Robust test collections are crucial for Information Retrieval research. ...
research
01/26/2022

Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models?

Neural retrieval models are generally regarded as fundamentally differen...
research
02/22/2023

One-Shot Labeling for Automatic Relevance Estimation

Dealing with unjudged documents ("holes") in relevance assessments is a ...
research
06/03/2018

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately and Affordably

Crowdsourcing offers an affordable and scalable means to collect relevan...
research
02/13/2022

Learning to Rank from Relevance Judgments Distributions

Learning to Rank (LETOR) algorithms are usually trained on annotated cor...
research
08/18/2023

How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods

Creating test collections for offline retrieval evaluation requires huma...

Please sign up or login with your details

Forgot password? Click here to reset