On Modular Training of Neural Acoustics-to-Word Model for LVCSR

03/03/2018
by   Zhehuai Chen, et al.
0

End-to-end (E2E) automatic speech recognition (ASR) systems directly map acoustics to words using a unified model. Previous works mostly focus on E2E training a single model which integrates acoustic and language model into a whole. Although E2E training benefits from sequence modeling and simplified decoding pipelines, large amount of transcribed acoustic data is usually required, and traditional acoustic and language modelling techniques cannot be utilized. In this paper, a novel modular training framework of E2E ASR is proposed to separately train neural acoustic and language models during training stage, while still performing end-to-end inference in decoding stage. Here, an acoustics-to-phoneme model (A2P) and a phoneme-to-word model (P2W) are trained using acoustic data and text data respectively. A phone synchronous decoding (PSD) module is inserted between A2P and P2W to reduce sequence lengths without precision loss. Finally, modules are integrated into an acousticsto-word model (A2W) and jointly optimized using acoustic data to retain the advantage of sequence modeling. Experiments on a 300- hour Switchboard task show significant improvement over the direct A2W model. The efficiency in both training and decoding also benefits from the proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2020

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in a...
research
09/22/2019

Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR

Acoustic-to-word (A2W) end-to-end automatic speech recognition (ASR) sys...
research
04/19/2021

Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

Subword units are commonly used for end-to-end automatic speech recognit...
research
03/30/2017

Simplified End-to-End MMI Training and Voting for ASR

A simplified speech recognition system that uses the maximum mutual info...
research
08/15/2017

Comparison of Decoding Strategies for CTC Acoustic Models

Connectionist Temporal Classification has recently attracted a lot of in...
research
05/15/2020

Context-Dependent Acoustic Modeling without Explicit Phone Clustering

Phoneme-based acoustic modeling of large vocabulary automatic speech rec...
research
12/01/2021

Predicting lexical skills from oral reading with acoustic measures

Literacy assessment is an important activity for education administrator...

Please sign up or login with your details

Forgot password? Click here to reset