Personalization of CTC Speech Recognition Models

10/18/2022
by   Saket Dingliwal, et al.
6

End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60 improvement in F1 score on domain-specific rare words over a strong CTC baseline.

READ FULL TEXT
research
05/18/2020

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

We present Mask CTC, a novel non-autoregressive end-to-end automatic spe...
research
04/05/2021

Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

How to leverage dynamic contextual information in end-to-end speech reco...
research
02/15/2021

Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT

Attention-based encoder-decoder (AED) models have achieved promising per...
research
06/21/2019

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Contextual automatic speech recognition, i.e., biasing recognition towar...
research
09/19/2023

End-to-End Speech Recognition Contextualization with Large Language Models

In recent years, Large Language Models (LLMs) have garnered significant ...
research
08/10/2020

Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition

Subwords are the most widely used output units in end-to-end speech reco...
research
06/15/2020

Regularized Forward-Backward Decoder for Attention Models

Nowadays, attention models are one of the popular candidates for speech ...

Please sign up or login with your details

Forgot password? Click here to reset