BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

by   Yuhao Liang, et al.

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1 pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4


page 2

page 4


Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

This paper presents a streaming speaker-attributed automatic speech reco...

Augmenting Transformer-Transducer Based Speaker Change Detection With Token-Level Training Loss

In this work we propose a novel token-based training strategy that impro...

ASR-Aware End-to-end Neural Diarization

We present a Conformer-based end-to-end neural diarization (EEND) model ...

Speaker and Language Change Detection using Wav2vec2 and Whisper

We investigate recent transformer networks pre-trained for automatic spe...

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

In this paper, we conduct a comparative study on speaker-attributed auto...

On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

We present a general framework to compute the word error rate (WER) of A...

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire

In multi-talker scenarios such as meetings and conversations, speech pro...

Please sign up or login with your details

Forgot password? Click here to reset