C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification

by   Chunlei Zhang, et al.

Self-supervised learning (SSL) has drawn an increased attention in the field of speech processing. Recent studies have demonstrated that contrastive learning is able to learn discriminative speaker embeddings in a self-supervised manner. However, base contrastive self-supervised learning (CSSL) assumes that the pairs generated from a view of anchor instance and any view of other instances are all negative, which introduces many false negative pairs in constructing the loss function. The problem is referred as class-collision, which remains as one major issue that impedes the CSSL based speaker verification (SV) systems from achieving better performances. In the meanwhile, studies reveal that negative sample free SSL frameworks perform well in learning speaker or image representations. In this study, we investigate SSL techniques that lead to an improved SV performance. We first analyse the impact of false negative pairs in the CSSL systems. Then, a multi-stage Class-Collision Correction (C3) method is proposed, which leads to the state-of-the-art CSSL based speaker embedding system. On the basis of the pretrained CSSL model, we further propose to employ a negative sample free SSL objective (i.e., DINO) to fine-tune the speaker embedding network. The resulting speaker embedding system (C3-DINO) achieves 2.5 Cosine Distance Scoring method on Voxceleb1 test set, which outperforms the previous SOTA SSL system (4.86 With speaker clustering and pseudo labeling on Voxceleb2 training set, a LDA/CDS back-end applying on the C3-DINO speaker embeddings is able to further push the EER to 2.2 benchmarks and our internal dataset demonstrate the effectiveness of our proposed methods, and the performance gap between the SSL SV and the supervised counterpart narrows further.


Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

Considering the abundance of unlabeled speech data and the high labeling...

Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning

In this study, we investigate self-supervised representation learning fo...

Pushing the limits of self-supervised speaker verification using regularized distillation framework

Training robust speaker verification systems without speaker labels has ...

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

We study a novel neural architecture and its training strategies of spea...

Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning

State-of-the-art speaker verification systems are inherently dependent o...

The JHU submission to VoxSRC-21: Track 3

This technical report describes Johns Hopkins University speaker recogni...

Curriculum Learning Meets Weakly Supervised Modality Correlation Learning

In the field of multimodal sentiment analysis (MSA), a few studies have ...

Please sign up or login with your details

Forgot password? Click here to reset