Frame-level multi-channel speaker verification with large-scale ad-hoc microphone arrays

10/12/2021
by   Chengdong Liang, et al.
0

Ad-hoc microphone arrays has recieved attention, in which the number and arrangement of microphones are unknown. Traditional multi-channel processing methods can not be directly used in ad-hoc. Recently, to solve this problem, an utterance-level ASV with ad-hoc microphone arrays has been proposed, which first extracts utterance-level speaker embeddings from each channel of an ad-hoc microphone array, and then fuses the embeddings for the final verification. However, this method cannot make full use of the cross-channel information. In this paper, we present a novel multi-channel ASV model at the frame-level. Specifically, we add spatio-temporal processing blocks (STB) before the pooling layer, which models the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from STB are sent to the pooling layer to obtain an utterance-level speaker representation. Experimental results demonstrate the effectiveness of the proposed method.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro