MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification

by   Fangyuan Wang, et al.

Most of the recent state-of-the-art results for speaker verification are achieved by X-vector and its subsequent variants. In this paper, we propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN). Firstly, we use the SE-Res2Blocks as in ECAPA-TDNN to explicitly model the channel interdependence to realize adaptive calibration of channel features, and process local context features in a multi-scale way at a more granular level compared with conventional TDNN-based methods. Secondly, we explore to use the encoder structure of Transformer to model the global context interdependence features at an utterance level which can capture better long term temporal characteristics. Before the pooling layer, we aggregate the outputs of SE-Res2Blocks and Transformer encoder to leverage the complementary channel and context interdependence features learned by themself respectively. Finally, instead of performing a single attentive statistics pooling, we also find it beneficial to extend the pooling method in a multi-head way which can discriminate features from multiple aspect. The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.


page 1

page 2

page 3

page 4

page 5


ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Current speaker verification techniques rely on a neural network to extr...

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification

X-vectors have become the standard for speaker-embeddings in automatic s...

Poformer: A simple pooling transformer for speaker verification

Most recent speaker verification systems are based on extracting speaker...

Phoneme-aware and Channel-wise Attentive Learning for Text DependentSpeaker Verification

This paper proposes a multi-task learning network with phoneme-aware and...

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

In this paper, we propose TitaNet, a novel neural network architecture f...

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

Typically, the Time-Delay Neural Network (TDNN) and Transformer can serv...

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking

Time delay neural network (TDNN) has been proven to be efficient for spe...

Please sign up or login with your details

Forgot password? Click here to reset