Speech Wikimedia: A 77 Language Multilingual Speech Dataset

08/30/2023
by   Rafael Mosquera Gómez, et al.
0

The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset