End to End ASR System with Automatic Punctuation Insertion

by   Yushi Guan, et al.

Recent Automatic Speech Recognition systems have been moving towards end-to-end systems that can be trained together. Numerous techniques that have been proposed recently enabled this trend, including feature extraction with CNNs, context capturing and acoustic feature modeling with RNNs, automatic alignment of input and output sequences using Connectionist Temporal Classifications, as well as replacing traditional n-gram language models with RNN Language Models. Historically, there has been a lot of interest in automatic punctuation in textual or speech to text context. However, there seems to be little interest in incorporating automatic punctuation into the emerging neural network based end-to-end speech recognition systems, partially due to the lack of English speech corpus with punctuated transcripts. In this study, we propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from ted.com. We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals. Combining Damerau Levenshtein Distance and slot error rate into DLev-SER, we enable measurement of punctuation error rate when the hypothesis text is not perfectly aligned with the reference. Compared with previous methods, our model reduces slot error rate from 0.497 to 0.341.


page 1

page 2

page 3

page 4


Exploration of End-to-End ASR for OpenSTT – Russian Open Speech-to-Text Dataset

This paper presents an exploration of end-to-end automatic speech recogn...

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

This paper presents a simple end-to-end model for speech recognition, co...

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in a...

Knowledge Transfer from Large-scale Pretrained Language Models to End-to-end Speech Recognizers

End-to-end speech recognition is a promising technology for enabling com...

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

Punctuated text prediction is crucial for automatic speech recognition a...

Automatic Chord Recognition with Higher-Order Harmonic Language Modelling

Common temporal models for automatic chord recognition model chord chang...

Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition

Knowledge distillation has been widely used to compress existing deep le...

Please sign up or login with your details

Forgot password? Click here to reset