CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

by   Arnaldo Candido Junior, et al.

Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were about 376 hours public available for ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 hours. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in different ASR applications. This paper presents CORAA (Corpus of Annotated Audios) v1. with 290.77 hours, a publicly available dataset for ASR in BP containing validated pairs (audio-transcription). CORAA also contains European Portuguese audios (4.69 hours). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53 and fine-tuned over CORAA. Our model achieved a Word Error Rate of 24.18 set. When measuring the Character Error Rate, we obtained 11.02 CORAA and Common Voice, respectively. CORAA corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at under the CC BY-NC-ND 4.0 license.


BembaSpeech: A Speech Recognition Corpus for the Bemba Language

We present a preprocessed, ready-to-use automatic speech recognition cor...

HypR: A comprehensive study for ASR hypothesis revising with a reference corpus

With the development of deep learning, automatic speech recognition (ASR...

ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers

Automatic speech recognition (ASR) via call is essential for various app...

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Recently, Conformer based CTC/AED model has become a mainstream architec...

English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System

Nowadays, research in speech technologies has gotten a lot out thanks to...

Investigating the effect of domain selection on automatic speech recognition performance: a case study on Bangladeshi Bangla

The performance of data-driven natural language processing systems is co...

Earnings-22: A Practical Benchmark for Accents in the Wild

Modern automatic speech recognition (ASR) systems have achieved superhum...

Please sign up or login with your details

Forgot password? Click here to reset