Speaker-aware speech-transformer

01/02/2020
by   Zhiyun Fan, et al.
0

Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as the study platform to investigate speaker aware training of E2E models. We propose a model called Speaker-Aware Speech-Transformer (SAST), which is a standard ST equipped with a speaker attention module (SAM). The SAM has a static speaker knowledge block (SKB) that is made of i-vectors. At each time step, the encoder output attends to the i-vectors in the block, and generates a weighted combined speaker embedding vector, which helps the model to normalize the speaker variations. The SAST model trained in this way becomes independent of specific training speakers and thus generalizes better to unseen testing speakers. We investigate different factors of SAM. Experimental results on the AISHELL-1 task show that SAST achieves a relative 6.5 baseline. Moreover, we demonstrate that SAST still works quite well even if the i-vectors in SKB all come from a different data source other than the acoustic training set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/07/2020

Investigation of Speaker-adaptation methods in Transformer based ASR

End-to-end models are fast replacing conventional hybrid models in autom...
research
10/16/2021

A Unified Speaker Adaptation Approach for ASR

Transformer models have been used in automatic speech recognition (ASR) ...
research
06/26/2022

Improving the Training Recipe for a Robust Conformer-based Hybrid Model

Speaker adaptation is important to build robust automatic speech recogni...
research
03/30/2022

Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

In traditional speaker diarization systems, a well-trained speaker model...
research
06/14/2021

End-to-end Neural Diarization: From Transformer to Conformer

We propose a new end-to-end neural diarization (EEND) system that is bas...
research
04/07/2017

Joint Probabilistic Linear Discriminant Analysis

Standard probabilistic discriminant analysis (PLDA) for speaker recognit...
research
12/14/2021

End-to-end speaker diarization with transformer

Speaker diarization is connected to semantic segmentation in computer vi...

Please sign up or login with your details

Forgot password? Click here to reset