Learning Speaker Embedding with Momentum Contrast

01/07/2020
by   Ke Ding, et al.
0

Speaker verification can be formulated as a representation learning task, where speaker-discriminative embeddings are extracted from utterances of variable lengths. Momentum Contrast (MoCo) is a recently proposed unsupervised representation learning framework, and has shown its effectiveness for learning good feature representation for downstream vision tasks. In this work, we apply MoCo to learn speaker embedding from speech segments. We explore MoCo for both unsupervised learning and pretraining settings. In the unsupervised scenario, embedding is learned by MoCo from audio data without using any speaker specific information. On a large scale dataset with 2,500 speakers, MoCo can achieve EER 4.275% trained unsupervisedly, and the EER can decrease further to 3.58% if extra unlabelled data are used. In the pretraining scenario, encoder trained by MoCo is used to initialize the downstream supervised training. With finetuning on the MoCo trained model, the equal error rate (EER) reduces 13.7% relative (1.44% to 1.242%) compared to a carefully tuned baseline training from scratch. Comparative study confirms the effectiveness of MoCo learning good speaker embedding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2020

Momentum Contrast Speaker Representation Learning

Unsupervised representation learning has shown remarkable achievement by...
research
11/03/2019

Robust speaker recognition using unsupervised adversarial invariance

In this paper, we address the problem of speaker recognition in challeng...
research
05/16/2022

PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification

Speaker embedding has been a fundamental feature for speaker-related tas...
research
02/02/2020

DropClass and DropAdapt: Dropping classes for deep speaker representation learning

Many recent works on deep speaker embeddings train their feature extract...
research
10/21/2020

Learning Speaker Embedding from Text-to-Speech

Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker vo...
research
11/01/2022

Disentangled representation learning for multilingual speaker recognition

The goal of this paper is to train speaker embeddings that are robust to...
research
08/08/2020

Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification

In this paper, we propose a novel way of addressing text-dependent autom...

Please sign up or login with your details

Forgot password? Click here to reset