Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition

06/30/2020
by   Maarten Van Segbroeck, et al.
0

Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency correlations in the acoustic input signals. A drawback of FLSTM based architectures however is that they operate at a predefined, and tuned, window size and stride, referred to as 'view' in this paper. We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views, into a dimensionality reduced feature representation. The proposed multi-view FLSTM architecture allows to model a wider range of time-frequency correlations compared to an FLSTM model with single view. When trained on 50K hours of English far-field speech data with CTC loss followed by sMBR sequence training, we show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7 different speaker and acoustic environment scenarios over an optimized single FLSTM model, while retaining a similar computational footprint.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2014

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

We propose a spatial diffuseness feature for deep neural network (DNN)-b...
research
11/13/2019

3-D Feature and Acoustic Modeling for Far-Field Speech Recognition

Automatic speech recognition in multi-channel reverberant conditions is ...
research
02/16/2018

Articulatory information and Multiview Features for Large Vocabulary Continuous Speech Recognition

This paper explores the use of multi-view features and their discriminat...
research
02/07/2018

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

The performance of automatic speech recognition systems degrades with in...
research
12/02/2020

Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Recent works have shown that Deep Recurrent Neural Networks using the LS...
research
06/12/2023

Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

Convolutional frontends are a typical choice for Transformer-based autom...
research
09/05/2020

A multi-view approach for Mandarin non-native mispronunciation verification

Traditionally, the performance of non-native mispronunciation verificati...

Please sign up or login with your details

Forgot password? Click here to reset