ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

03/17/2022
by   Thomas Hartvigsen, et al.
0

Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that annotators struggle to distinguish machine-generated text from human-written language. We also find that 94.5 by human annotators. Using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. We also demonstrate that ToxiGen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset.

READ FULL TEXT

page 2

page 18

research
10/06/2020

RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text

In recent years, large neural networks for natural language generation (...
research
04/24/2023

CHEAT: A Large-scale Dataset for Detecting ChatGPT-writtEn AbsTracts

The powerful ability of ChatGPT has caused widespread concern in the aca...
research
08/12/2021

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate

Detecting online hate is a complex task, and low-performing models have ...
research
02/25/2022

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets

Detecting toxic or pejorative expressions in online communities has beco...
research
05/29/2023

Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

To recognize and mitigate harms from large language models (LLMs), we ne...
research
07/10/2018

Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme

In this paper, we propose a joint architecture that captures language, r...
research
03/18/2023

NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online

Online texts with toxic content are a threat in social media that might ...

Please sign up or login with your details

Forgot password? Click here to reset