Hidden Biases in Unreliable News Detection Datasets

04/20/2021
by   Xiang Zhou, et al.
7

Automatic unreliable news detection is a research problem with great potential impact. Recently, several papers have shown promising results on large-scale news datasets with models that only use the article itself without resorting to any fact-checking mechanism or retrieving any supporting evidence. In this work, we take a closer look at these datasets. While they all provide valuable resources for future research, we observe a number of problems that may lead to results that do not generalize in more realistic settings. Specifically, we show that selection bias during data collection leads to undesired artifacts in the datasets. In addition, while most systems train and predict at the level of individual articles, overlapping article sources in the training and evaluation data can provide a strong confounding factor that models can exploit. In the presence of this confounding factor, the models can achieve good performance by directly memorizing the site-label mapping instead of modeling the real task of unreliable news detection. We observed a significant drop (>10 no train/test source overlap. Using the observations and experimental results, we provide practical suggestions on how to create more reliable datasets for the unreliable news detection task. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Identifying Informational Sources in News Articles

News articles are driven by the informational sources journalists use in...
research
10/27/2018

Suspicious News Detection Using Micro Blog Text

We present a new task, suspicious news detection using micro blog text. ...
research
04/06/2023

Towards Corpus-Scale Discovery of Selection Biases in News Coverage: Comparing What Sources Say About Entities as a Start

News sources undergo the process of selecting newsworthy information whe...
research
02/28/2019

Adversarial Training for Satire Detection: Controlling for Confounding Variables

The automatic detection of satire vs. regular news is relevant for downs...
research
01/13/2023

Using the profile of publishers to predict barriers across news articles

Detection of news propagation barriers, being economical, cultural, poli...
research
02/12/2020

Detect and Correct Bias in Multi-Site Neuroimaging Datasets

The desire to train complex machine learning algorithms and to increase ...
research
04/24/2018

Semi-supervised Content-based Detection of Misinformation via Tensor Embeddings

Fake news may be intentionally created to promote economic, political an...

Please sign up or login with your details

Forgot password? Click here to reset