Learning Alignment for Multimodal Emotion Recognition from Speech

09/06/2019
by   Haiyang Xu, et al.
0

Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2023

An Empirical Study and Improvement for Speech Emotion Recognition

Multimodal speech emotion recognition aims to detect speakers' emotions ...
research
07/26/2022

Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

In this paper, we propose a novel speech emotion recognition model calle...
research
10/24/2020

Learning Fine-Grained Multimodal Alignment for Speech Emotion Recognition

Speech emotion recognition is a challenging task because the emotion exp...
research
08/04/2023

Capturing Spectral and Long-term Contextual Information for Speech Emotion Recognition Using Deep Learning Techniques

Traditional approaches in speech emotion recognition, such as LSTM, CNN,...
research
10/20/2019

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Speech Emotion Recognition (SER) has emerged as a critical component of ...
research
05/07/2023

Learning Robust Self-attention Features for Speech Emotion Recognition with Label-adaptive Mixup

Speech Emotion Recognition (SER) is to recognize human emotions in a nat...
research
05/17/2018

Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

Emotion recognition has become a popular topic of interest, especially i...

Please sign up or login with your details

Forgot password? Click here to reset