SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages

05/30/2023
by   Philippe Laban, et al.
0

Text simplification research has mostly focused on sentence-level simplification, even though many desirable edits - such as adding relevant background information or reordering content - may require document-level context. Prior work has also predominantly framed simplification as a single-step, input-to-output task, only implicitly modeling the fine-grained, span-level edits that elucidate the simplification process. To address both gaps, we introduce the SWiPE dataset, which reconstructs the document-level editing process from English Wikipedia (EW) articles to paired Simple Wikipedia (SEW) articles. In contrast to prior work, SWiPE leverages the entire revision history when pairing pages in order to better identify simplification edits. We work with Wikipedia editors to annotate 5,000 EW-SEW document pairs, labeling more than 40,000 edits with proposed 19 categories. To scale our efforts, we propose several models to automatically label edits, achieving an F-1 score of up to 70.6, indicating that this is a tractable but challenging NLU task. Finally, we categorize the edits produced by several simplification models and find that SWiPE-trained models generate more complex edits while reducing unwanted edits.

READ FULL TEXT

page 19

page 20

research
10/11/2021

Document-Level Text Simplification: Dataset, Criteria and Baseline

Text simplification is a valuable technique. However, current research i...
research
05/15/2018

Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia

We study the task of generating from Wikipedia articles question-answer ...
research
03/22/2023

XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages

Lack of encyclopedic text contributors, especially on Wikipedia, makes a...
research
04/18/2021

Go Forth and Prosper: Language Modeling with Ancient Textual History

We introduce a technique for improving document-level language models (L...
research
03/23/2017

TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

We present a dataset that contains every instance of all tokens ( words...
research
12/16/2022

How to disagree well: Investigating the dispute tactics used on Wikipedia

Disagreements are frequently studied from the perspective of either dete...
research
05/24/2010

Distantly Labeling Data for Large Scale Cross-Document Coreference

Cross-document coreference, the problem of resolving entity mentions acr...

Please sign up or login with your details

Forgot password? Click here to reset