Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision

08/05/2023

∙

Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. Previous studies have noted a substantial performance disparity between self-supervised and fully supervised approaches. In this paper, we propose an effective Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate self-supervised speaker representation learning. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP framework in speaker verification. SDEP achieves a new SOTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate 1.94%, 1.99%, and 3.77% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. Code will be publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.

READ FULL TEXT

Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision

Sign in with Google

Consider DeepAI Pro