Utilizing Self-supervised Representations for MOS Prediction

by   Wei-Cheng Tseng, et al.

Speech quality assessment has been a critical issue in speech processing for decades. Existing automatic evaluations usually require clean references or parallel ground truth data, which is infeasible when the amount of data soars. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. However, such a test is expensive and time-consuming because crowd work is necessary. It thus becomes highly desired to develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data. In this paper, we use self-supervised pre-trained models for MOS prediction. We show their representations can distinguish between clean and noisy audios. Then, we fine-tune these pre-trained models followed by simple linear layers in an end-to-end manner. The experiment results showed that our framework outperforms the two previous state-of-the-art models by a significant improvement on Voice Conversion Challenge 2018 and achieves comparable or superior performance on Voice Conversion Challenge 2016. We also conducted an ablation study to further investigate how each module benefits the task. The experiment results are implemented and reproducible with publicly available toolkits.


page 1

page 2

page 3

page 4


Investigating self-supervised front ends for speech spoofing countermeasures

Self-supervised speech model is a rapid progressing research topic, and ...

Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models

In this work, we analyzed and compared speech representations extracted ...

Efficient Adapters for Giant Speech Models

Large pre-trained speech models are widely used as the de-facto paradigm...

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

Benefiting from the development of deep learning, text-to-speech (TTS) t...

Improving Self-Supervised Learning-based MOS Prediction Networks

MOS (Mean Opinion Score) is a subjective method used for the evaluation ...

Comparison of Speech Representations for the MOS Prediction System

Automatic methods to predict Mean Opinion Score (MOS) of listeners have ...

Please sign up or login with your details

Forgot password? Click here to reset