RoDia: A New Dataset for Romanian Dialect Identification from Speech

09/06/2023
by   Codrut Rotaru, et al.
0

Dialect identification is a critical task in speech processing and language technology, enhancing various applications such as speech recognition, speaker verification, and many others. While most research studies have been dedicated to dialect identification in widely spoken languages, limited attention has been given to dialect identification in low-resource languages, such as Romanian. To address this research gap, we introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83 indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We publicly release our dataset and code at https://github.com/codrut2/RoDia.

READ FULL TEXT
research
06/26/2023

Uncovering Political Hate Speech During Indian Election Campaign: A New Low-Resource Dataset and Baselines

The detection of hate speech in political discourse is a critical issue,...
research
03/22/2023

AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages

The advancement of speech technologies has been remarkable, yet its inte...
research
06/07/2021

SIGTYP 2021 Shared Task: Robust Spoken Language Identification

While language identification is a fundamental speech and language proce...
research
10/10/2022

YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

Visually grounded speech (VGS) models are trained on images paired with ...
research
08/02/2018

AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Speech activity detection (or endpointing) is an important processing st...
research
02/15/2022

textless-lib: a Library for Textless Spoken Language Processing

Textless spoken language processing research aims to extend the applicab...
research
07/04/2022

Vietnamese Capitalization and Punctuation Recovery Models

Despite the rise of recent performant methods in Automatic Speech Recogn...

Please sign up or login with your details

Forgot password? Click here to reset