On the State of German (Abstractive) Text Summarization

01/17/2023
by   Dennis Aumiller, et al.
0

With recent advancements in the area of Natural Language Processing, the focus is slowly shifting from a purely English-centric view towards more language-specific solutions, including German. Especially practical for businesses to analyze their growing amount of textual data are text summarization systems, which transform long input documents into compressed and more digestible summary texts. In this work, we assess the particular landscape of German abstractive text summarization and investigate the reasons why practically useful solutions for abstractive text summarization are still absent in industry. Our focus is two-fold, analyzing a) training resources, and b) publicly available summarization systems. We are able to show that popular existing datasets exhibit crucial flaws in their assumptions about the original sources, which frequently leads to detrimental effects on system generalization and evaluation biases. We confirm that for the most popular training dataset, MLSUM, over 50 purposes. Furthermore, available systems frequently fail to compare to simple baselines, and ignore more effective and efficient extractive summarization approaches. We attribute poor evaluation quality to a variety of different factors, which are investigated in more detail in this work: A lack of qualitative (and diverse) gold data considered for training, understudied (and untreated) positional biases in some of the existing datasets, and the lack of easily accessible and streamlined pre-processing strategies or analysis tools. We provide a comprehensive assessment of available models on the cleaned datasets, and find that this can lead to a reduction of more than 20 ROUGE-1 points during evaluation. The code for dataset filtering and reproducing results can be found online at https://github.com/dennlinger/summaries

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2022

Klexikon: A German Dataset for Joint Summarization and Simplification

Traditionally, Text Simplification is treated as a monolingual translati...
research
07/08/2022

A Medical Information Extraction Workbench to Process German Clinical Text

Background: In the information extraction and natural language processin...
research
05/13/2021

Towards Human-Free Automatic Quality Evaluation of German Summarization

Evaluating large summarization corpora using humans has proven to be exp...
research
05/22/2023

Evaluating Factual Consistency of Texts with Semantic Role Labeling

Automated evaluation of text generation systems has recently seen increa...
research
01/26/2023

LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization

Text Summarization is a popular task and an active area of research for ...
research
04/21/2017

A Semantic QA-Based Approach for Text Summarization Evaluation

Many Natural Language Processing and Computational Linguistics applicati...
research
05/05/2022

Introducing the Welsh Text Summarisation Dataset and Baseline Systems

Welsh is an official language in Wales and is spoken by an estimated 884...

Please sign up or login with your details

Forgot password? Click here to reset