Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

09/21/2023
by   Shuai Wang, et al.
0

Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2022

Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

Speaker recognition, recognizing speaker identities based on voice alone...
research
04/10/2021

Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining

Most of the 3D networks are trained from scratch owning to the lack of l...
research
03/23/2023

The effectiveness of MAE pre-pretraining for billion-scale pretraining

This paper revisits the standard pretrain-then-finetune paradigm used in...
research
06/29/2023

MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset

Pretraining with large-scale 3D volumes has a potential for improving th...
research
09/08/2023

End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

The SOTA in transcription of disfluent and conversational speech has in ...
research
08/29/2023

A General-Purpose Self-Supervised Model for Computational Pathology

Tissue phenotyping is a fundamental computational pathology (CPath) task...
research
03/28/2022

Training speaker recognition systems with limited data

This work considers training neural networks for speaker recognition wit...

Please sign up or login with your details

Forgot password? Click here to reset