Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends

by   Siddique Latif, et al.

Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech—a gap that our survey aims to bridge.


page 1

page 3


Do We Still Need Automatic Speech Recognition for Spoken Language Understanding?

Spoken language understanding (SLU) tasks are usually solved by first tr...

Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations

Although automatic emotion recognition (AER) has recently drawn signific...

Towards adversarial learning of speaker-invariant representation for speech emotion recognition

Speech emotion recognition (SER) has attracted great attention in recent...

Speaker-invariant Affective Representation Learning via Adversarial Training

Representation learning for speech emotion recognition is challenging du...

Knowing What to Listen to: Early Attention for Deep Speech Representation Learning

Deep learning techniques have considerably improved speech processing in...

Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks

In hybrid hidden Markov model/artificial neural networks (HMM/ANN) autom...

Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization

Automatic speech recognition (ASR) has recently become an important chal...

Please sign up or login with your details

Forgot password? Click here to reset