Attention Based Fully Convolutional Network for Speech Emotion Recognition

06/05/2018
by   Yuanyuan Zhang, et al.
0

Speech emotion recognition is a challenging task for three main reasons: 1) human emotion is abstract, which means it is hard to distinguish; 2) in general, human emotion can only be detected in some specific moments during a long utterance; 3) speech data with emotional labeling is usually limited. In this paper, we present a novel attention based fully convolutional network for speech emotion recognition. We employ fully convolutional network as it is able to handle variable-length speech, free of the demand of segmentation to keep critical information not lost. The proposed attention mechanism can make our model be aware of which time-frequency region of speech spectrogram is more emotion-relevant. Considering limited data, the transfer learning is also adapted to improve the accuracy. Especially, it's interesting to observe obvious improvement obtained with natural scene image based pre-trained model. Validated on the publicly available IEMOCAP corpus, the proposed model outperformed the state-of-the-art methods with a weighted accuracy of 70.4 an unweighted accuracy of 63.9

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset