Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

05/03/2020
by   Xuanli He, et al.
0

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English <=> (German, Romanian, Estonian, Finnish, Hungarian).

READ FULL TEXT
research
07/31/2023

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Sub-word segmentation is an essential pre-processing step for Neural Mac...
research
06/08/2020

Wat zei je? Detecting Out-of-Distribution Translations with Variational Transformers

We detect out-of-training-distribution sentences in Neural Machine Trans...
research
09/03/2014

Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation

The authors of (Cho et al., 2014a) have shown that the recently introduc...
research
09/20/2020

Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models

The discrepancy between maximum likelihood estimation (MLE) and task mea...
research
09/15/2020

Iterative Refinement in the Continuous Space for Non-Autoregressive Neural Machine Translation

We propose an efficient inference procedure for non-autoregressive machi...
research
06/28/2021

R-Drop: Regularized Dropout for Neural Networks

Dropout is a powerful and widely used technique to regularize the traini...
research
10/29/2019

BPE-Dropout: Simple and Effective Subword Regularization

Subword segmentation is widely used to address the open vocabulary probl...

Please sign up or login with your details

Forgot password? Click here to reset