Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

by   Chenglong Wang, et al.

Existing fake audio detection systems perform well in in-domain testing, but still face many challenges in out-of-domain testing. This is due to the mismatch between the training and test data, as well as the poor generalizability of features extracted from limited views. To address this, we propose multi-view features for fake audio detection, which aim to capture more generalized features from prosodic, pronunciation, and wav2vec dimensions. Specifically, the phoneme duration features are extracted from a pre-trained model based on a large amount of speech data. For the pronunciation features, a Conformer-based phoneme recognition model is first trained, keeping the acoustic encoder part as a deeply embedded feature extractor. Furthermore, the prosodic and pronunciation features are fused with wav2vec features based on an attention mechanism to improve the generalization of fake audio detection models. Results show that the proposed approach achieves significant performance gains in several cross-dataset experiments.


page 1

page 2

page 3

page 4


An Audio-Visual Attention Based Multimodal Network for Fake Talking Face Videos Detection

DeepFake based digital facial forgery is threatening the public media se...

An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

Many effective attempts have been made for fake audio detection. However...

Fully Automated End-to-End Fake Audio Detection

The existing fake audio detection systems often rely on expert experienc...

Estimation and Model Misspecification: Fake and Missing Features

We consider estimation under model misspecification where there is a mod...

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Recently, pioneer research works have proposed a large number of acousti...

TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

Current fake audio detection relies on hand-crafted features, which lose...

Spec-ResNet: A General Audio Steganalysis scheme based on Deep Residual Network of Spectrogram

The widespread application of audio and video communication technology m...

Please sign up or login with your details

Forgot password? Click here to reset