Few-shot learning through contextual data augmentation

03/31/2021
by   Farid Arthaud, et al.
0

Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in human-submitted translations, and (ii) corresponding evaluation metrics to compare our approaches. We extend a data augmentation approach using a pre-trained language model to create training examples with similar contexts for novel words. We compare different fine-tuning and data augmentation approaches and show that adaptation on the scale of one to five examples is possible. Combining data augmentation with randomly selected training sentences leads to the highest BLEU score and accuracy improvements. Impressively, with only 1 to 5 examples, our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2023

Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec ...
research
11/05/2020

Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation

Out of vocabulary (OOV) is a problem in the context of Machine Translati...
research
05/25/2019

Soft Contextual Data Augmentation for Neural Machine Translation

While data augmentation is an important trick to boost the accuracy of d...
research
03/22/2023

Selective Data Augmentation for Robust Speech Translation

Speech translation (ST) systems translate speech in one language to text...
research
11/12/2019

Learning from Data-Rich Problems: A Case Study on Genetic Variant Calling

Next Generation Sequencing can sample the whole genome (WGS) or the 1-2 ...
research
09/04/2017

Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects

This paper describes a preliminary study for producing and distributing ...
research
05/08/2023

Target-Side Augmentation for Document-Level Machine Translation

Document-level machine translation faces the challenge of data sparsity ...

Please sign up or login with your details

Forgot password? Click here to reset