Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

05/27/2021
by   Felix Stahlberg, et al.
0

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by human writers. In this work, we use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation. We compare several models that can produce an ungrammatical sentence given a clean sentence and an error type tag. We use these models to build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set. Our synthetic data set yields large and consistent gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

READ FULL TEXT
research
08/20/2022

Judge a Sentence by Its Content to Generate Grammatical Errors

Data sparsity is a well-known problem for grammatical error correction (...
research
10/31/2022

Evaluation of large-scale synthetic data for Grammar Error Correction

Grammar Error Correction(GEC) mainly relies on the availability of high ...
research
04/20/2021

Grammatical Error Generation Based on Translated Fragments

We perform neural machine translation of sentence fragments in order to ...
research
09/20/2023

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Grammatical Error Detection and Correction (GEC) tools have proven usefu...
research
10/25/2022

Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Research on Korean grammatical error correction (GEC) is limited compare...
research
09/26/2018

Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection

Grammatical error correction, like other machine learning tasks, greatly...
research
10/19/2022

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

Chinese Grammatical Error Correction (CGEC) is both a challenging NLP ta...

Please sign up or login with your details

Forgot password? Click here to reset