Challenging the Boundaries of Speech Recognition: The MALACH Corpus

by   Michael Picheny, et al.

There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large archive of Holocaust testimonies collected by the Survivors of the Shoah Visual History Foundation, presents significant challenges to the speech community. The collection consists of unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching, and emotional speech - all still open problems for speech recognition systems. Transcription is challenging even for skilled human annotators. This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech. To reduce the barrier for entry, a lexicon and training and testing setups have been created and baseline results using current deep learning technologies are presented. The metadata has just been released by LDC (LDC2019S11). It is hoped that this resource will enable the community to build on top of these baselines so that the extremely important information in these and related oral histories becomes accessible to a wider audience.


page 1

page 2

page 3

page 4


AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline

An open-source Mandarin speech corpus called AISHELL-1 is released. It i...

On Generalization and Regularization in Deep Learning

Why do large neural network generalize so well on complex tasks such as ...

Lahjoita puhetta – a large-scale corpus of spoken Finnish with some benchmarks

The Donate Speech campaign has so far succeeded in gathering approximate...

QASR: QCRI Aljazeera Speech Resource – A Large Scale Annotated Arabic Speech Corpus

We introduce the largest transcribed Arabic speech corpus, QASR, collect...

THCHS-30 : A Free Chinese Speech Corpus

Speech data is crucially important for speech recognition research. Ther...

To study the phenomenon of the Moravec's Paradox

"Encoded in the large, highly evolved sensory and motor portions of the ...

EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels

The increasing adoption of text-to-speech technologies has led to a grow...

Please sign up or login with your details

Forgot password? Click here to reset