Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting

by   Tiantian Feng, et al.

Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can generate transcriptions to enhance the performance of SER systems that rely solely on speech data. Furthermore, we note that annotating emotions from transcribed speech remains a challenging task. However, combining outputs from multiple LLMs enhances the quality of annotations. Lastly, our findings suggest the feasibility of augmenting existing speech emotion datasets by annotating unlabeled speech samples.


page 1

page 2

page 3

page 4


Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation

Despite advances in deep learning, current state-of-the-art speech emoti...

A Dataset for Speech Emotion Recognition in Greek Theatrical Plays

Machine learning methodologies can be adopted in cultural applications a...

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Automatic speech emotion recognition (SER) is a challenging task that pl...

LanSER: Language-Model Supported Speech Emotion Recognition

Speech emotion recognition (SER) models typically rely on costly human-l...

Implicit Design Choices and Their Impact on Emotion Recognition Model Development and Evaluation

Emotion recognition is a complex task due to the inherent subjectivity i...

The Origin and Value of Disagreement Among Data Labelers: A Case Study of the Individual Difference in Hate Speech Annotation

Human annotated data is the cornerstone of today's artificial intelligen...

Please sign up or login with your details

Forgot password? Click here to reset