VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

09/12/2022
by   Naoyuki Kanda, et al.
6

This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error rates of 13.7 respectively, in the multiple-distant-microphone setting while retaining the streaming inference capability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2022

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

This paper proposes a token-level serialized output training (t-SOT), a ...
research
03/31/2021

Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone

Transcribing meetings containing overlapped speech with only a single di...
research
04/07/2022

Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

Existing multi-channel continuous speech separation (CSS) models are hea...
research
10/27/2022

Simulating realistic speech overlaps improves multi-talker ASR

Multi-talker automatic speech recognition (ASR) has been studied to gene...
research
06/14/2020

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

This paper summarizes the JHU team's efforts in tracks 1 and 2 of the CH...
research
06/23/2023

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

The CHiME challenges have played a significant role in the development a...
research
09/14/2023

DiariST: Streaming Speech Translation with Speaker Diarization

End-to-end speech translation (ST) for conversation recordings involves ...

Please sign up or login with your details

Forgot password? Click here to reset