Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances

04/07/2020
by   Youngmoon Jung, et al.
0

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, we obtain a speaker embedding vector by pooling single-scale features that are extracted from the last layer of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes multiscale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To increase the robustness dealing with utterances of arbitrary duration, this paper improves the MSA by using a feature pyramid module. The module enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information with different time scales. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of parameters. It also achieves better performance than state-of-the-art approaches for both short and long utterances.

READ FULL TEXT
research
04/07/2020

Multi-Scale Aggregation Using Feature Pyramid Module for Text-Independent Speaker Verification

Currently, the most widely used approach for speaker verification is the...
research
05/16/2021

X-Vectors with Multi-Scale Aggregation for Speaker Diarization

Speaker diarization is the process of labeling different speakers in a s...
research
05/07/2020

Segment Aggregation for short utterances speaker verification using raw waveforms

Most studies on speaker verification systems focus on long-duration utte...
research
06/28/2023

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

The previous SpEx+ has yielded outstanding performance in speaker extrac...
research
08/30/2021

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

The convolutional neural network (CNN) based approaches have shown great...
research
07/06/2020

ResNeXt and Res2Net Structure for Speaker Verification

ResNet-based architecture has been widely adopted as the speaker embeddi...
research
05/22/2023

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

Effective fusion of multi-scale features is crucial for improving speake...

Please sign up or login with your details

Forgot password? Click here to reset