Direct Modelling of Speech Emotion from Raw Speech

by   Siddique Latif, et al.

Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not always guaranteed to be the best in a statistical modelling framework where the end goal is for example emotion classification. This has fuelled the emerging trend of learning representations from raw speech especially using deep learning neural networks. In particular, a combination of Convolution Neural Networks (CNNs) and Long Short Term Memory (LSTM) have gained traction in this field for the intrinsic property of LSTM in learning contextual information crucial for emotion recognition and CNNs been used for its ability to overcome the scalability problem of regular neural networks. In this paper, we show that there are still opportunities to improve the performance of emotion recognition from the raw speech by exploiting the properties of CNN in modelling contextual information. We propose the use of parallel convolutional layers in the feature extraction block that are jointly trained with the LSTM based classification network for emotion recognition task. Our results suggest that the proposed model can reach the performance of CNN with hand-engineered features on IEMOCAP and MSP-IMPROV datasets.


page 1

page 2

page 3

page 4


Emotion Recognition from Speech

In this work, we conduct an extensive comparison of various approaches t...

Efficient Arabic emotion recognition using deep neural networks

Emotion recognition from speech signal based on deep learning is an acti...

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

This paper presents an audiovisual-based emotion recognition hybrid netw...

Learning spectro-temporal features with 3D CNNs for speech emotion recognition

In this paper, we propose to use deep 3-dimensional convolutional networ...

Novel Dual-Channel Long Short-Term Memory Compressed Capsule Networks for Emotion Recognition

Recent analysis on speech emotion recognition has made considerable adva...

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Speech Emotion Recognition (SER) has emerged as a critical component of ...

Improving Speech Emotion Recognition Performance using Differentiable Architecture Search

Speech Emotion Recognition (SER) is a critical enabler of emotion-aware ...

Please sign up or login with your details

Forgot password? Click here to reset