End-to-end speech translation (ST) for conversation recordings involves
...
The convergence of text, visual, and audio data is a key step towards
hu...
Several trade-offs need to be balanced when employing monaural speech
se...
Self-supervised learning (SSL) methods such as WavLM have shown promisin...
Multi-talker automatic speech recognition (ASR) has been studied to gene...
This paper presents a novel streaming automatic speech recognition (ASR)...
This paper describes a speaker diarization model based on target speaker...
Human intelligence is multimodal; we integrate visual, linguistic, and
a...
Existing multi-channel continuous speech separation (CSS) models are hea...
This paper presents a streaming speaker-attributed automatic speech
reco...
This paper proposes a token-level serialized output training (t-SOT), a ...
Multi-talker conversational speech processing has drawn many interests f...
Self-supervised learning (SSL) achieves great success in speech recognit...
Continuous speech separation (CSS) aims to separate overlapping voices f...
Continuous speech separation using a microphone array was shown to be
pr...
This paper presents Transcribe-to-Diarize, a new approach for neural spe...
Text-only adaptation of an end-to-end (E2E) model remains a challenging ...
Speaker-attributed automatic speech recognition (SA-ASR) is a task to
re...
Speech separation has been successfully applied as a frontend processing...
Integrating external language models (LMs) into end-to-end (E2E) models
...
This paper presents our recent effort on end-to-end speaker-attributed
a...
In multi-talker scenarios such as meetings and conversations, speech
pro...
Transcribing meetings containing overlapped speech with only a single di...
End-to-end (E2E) spoken language understanding (SLU) can infer semantics...
The efficacy of external language model (LM) integration with existing
e...
Speaker diarization is a task to label audio or video recordings with cl...
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-...
End-to-end multi-talker speech recognition is an emerging research trend...
Joint optimization of multi-channel front-end and automatic speech
recog...
Recently, an end-to-end speaker-attributed automatic speech recognition ...
Multi-speaker speech recognition of unsegmented recordings has diverse
a...
The external language models (LM) integration remains a challenging task...
Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end...
This paper describes the Microsoft speaker diarization system for monaur...
Recently, an end-to-end (E2E) speaker-attributed automatic speech recogn...
We propose an end-to-end speaker-attributed automatic speech recognition...
This paper proposes serialized output training (SOT), a novel framework ...
This paper investigates the use of target-speaker automatic speech
recog...
Speaker diarization has been mainly developed based on the clustering of...
In this paper, we propose a novel end-to-end neural-network-based speake...
In this paper, we propose a novel auxiliary loss function for target-spe...
In this paper, we present Hitachi and Paderborn University's joint effor...