Examining the Tip of the Iceberg: A Data Set for Idiom Translation

02/13/2018
by   Marzieh Fadaee, et al.
0

Neural Machine Translation (NMT) has been widely used in recent years with significant improvements for many language pairs. Although state-of-the-art NMT systems are generating progressively better translations, idiom translation remains one of the open challenges in this field. Idioms, a category of multiword expressions, are an interesting language phenomenon where the overall meaning of the expression cannot be composed from the meanings of its parts. A first important challenge is the lack of dedicated data sets for learning and evaluating idiom translation. In this paper we address this problem by creating the first large-scale data set for idiom translation. Our data set is automatically extracted from a widely used German-English translation corpus and includes, for each language direction, a targeted evaluation set where all sentences contain idioms and a regular training corpus where sentences including idioms are marked. We release this data set and use it to perform preliminary NMT experiments as the first step towards better idiom translation.

READ FULL TEXT
research
10/17/2017

Paying Attention to Multi-Word Expressions in Neural Machine Translation

Processing of multi-word expressions (MWEs) is a known problem for any n...
research
12/14/2016

How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs

Analysing translation quality in regards to specific linguistic phenomen...
research
12/19/2016

Neural Machine Translation from Simplified Translations

Text simplification aims at reducing the lexical, grammatical and struct...
research
10/17/2020

A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences

Multimodal neural machine translation (NMT) has become an increasingly i...
research
10/10/2022

Automatic Evaluation and Analysis of Idioms in Neural Machine Translation

A major open problem in neural machine translation (NMT) is the translat...
research
08/11/2020

A parallel evaluation data set of software documentation with document structure annotation

This paper accompanies the software documentation data set for machine t...
research
09/06/2019

Self Learning from Large Scale Code Corpus to Infer Structure of Method Invocations

Automatically generating code from a textual description of method invoc...

Please sign up or login with your details

Forgot password? Click here to reset