Data Augmentation with Unsupervised Speaking Style Transfer for Speech Emotion Recognition

by   Leyuan Qu, et al.

Currently, the performance of Speech Emotion Recognition (SER) systems is mainly constrained by the absence of large-scale labelled corpora. Data augmentation is regarded as a promising approach, which borrows methods from Automatic Speech Recognition (ASR), for instance, perturbation on speed and pitch, or generating emotional speech utilizing generative adversarial networks. In this paper, we propose EmoAug, a novel style transfer model to augment emotion expressions, in which a semantic encoder and a paralinguistic encoder represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech in different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. In addition, we can also generate similar numbers of samples for each class to tackle the data imbalance issue. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that it not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.


page 1

page 3

page 4


Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Human speech can be characterized by different components, including sem...

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Emotional voice conversion aims to transform emotional prosody in speech...

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

In this paper, we explored how to boost speech emotion recognition (SER)...

Augmenting Generative Adversarial Networks for Speech Emotion Recognition

Generative adversarial networks (GANs) have shown potential in learning ...

A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model

In this paper, we propose to utilise diffusion models for data augmentat...

Read it to me: An emotionally aware Speech Narration Application

In this work we try to perform emotional style transfer on audios. In pa...

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

In Speech Emotion Recognition (SER), emotional characteristics often app...

Please sign up or login with your details

Forgot password? Click here to reset