On the Impact of Noises in Crowd-Sourced Data for Speech Translation

06/28/2022
by   Siqi Ouyang, et al.
0

Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of the most widely used ST benchmark datasets. It contains around 400 hours of speech-transcript-translation data for each of the eight translation directions. This dataset passes several quality-control filters during creation. However, we find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name. What are the impacts of these data quality issues for model development and evaluation? In this paper, we propose an automatic method to fix or filter the above quality issues, using English-German (En-De) translation as an example. Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets. Besides, simply removing misaligned data points from the training set does not lead to a better ST model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2023

SeamlessM4T-Massively Multilingual Multimodal Machine Translation

What does it take to create the Babel Fish, a tool that can help individ...
research
04/08/2022

GigaST: A 10,000-hour Pseudo Speech Translation Corpus

This paper introduces GigaST, a large-scale pseudo speech translation (S...
research
04/22/2022

LibriS2S: A German-English Speech-to-Speech Translation Corpus

Recently, we have seen an increasing interest in the area of speech-to-t...
research
06/17/2021

Lost in Interpreting: Speech Translation from Source or Interpreter?

Interpreters facilitate multi-lingual meetings but the affordable set of...
research
11/08/2022

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-...
research
09/20/2023

SpeechAlign: a Framework for Speech Translation Alignment Evaluation

Speech-to-Speech and Speech-to-Text translation are currently dynamic ar...
research
06/11/2021

HUI-Audio-Corpus-German: A high quality TTS dataset

The increasing availability of audio data on the internet lead to a mult...

Please sign up or login with your details

Forgot password? Click here to reset