A neural attention model for speech command recognition

08/27/2018

∙

This paper introduces a convolutional recurrent network with attention for speech command recognition. Attention models are powerful tools to improve performance on natural language, image captioning and speech tasks. The proposed model establishes a new state-of-the-art accuracy of 94.1 Speech Commands dataset V1 and 94.5 task), while still keeping a small footprint of only 202K trainable parameters. Results are compared with previous convolutional implementations on 5 different tasks (20 commands recognition (V1 and V2), 12 commands recognition (V1), 35 word recognition (V1) and left-right (V1)). We show detailed performance results and demonstrate that the proposed attention mechanism not only improves performance but also allows inspecting what regions of the audio were taken into consideration by the network when outputting a given category.

READ FULL TEXT

A neural attention model for speech command recognition

Integrating Whole Context to Sequence-to-sequence Speech Recognition

Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder

Adaptively Aligned Image Captioning via Adaptive Attention Time

One Model To Learn Them All

Attention-based sequence-to-sequence model for speech recognition: development of state-of-the-art system on LibriSpeech and its application to non-native English

Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition

Lipreading using Temporal Convolutional Networks

A neural attention model for speech command recognition

Related Research

Integrating Whole Context to Sequence-to-sequence Speech Recognition

Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder

Adaptively Aligned Image Captioning via Adaptive Attention Time

One Model To Learn Them All

Attention-based sequence-to-sequence model for speech recognition: development of state-of-the-art system on LibriSpeech and its application to non-native English

Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition

Lipreading using Temporal Convolutional Networks