Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

05/23/2023
by   Daisuke Niizumi, et al.
0

Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field. Our code is available at: https://github.com/nttcslab/m2d/tree/master/speech

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2023

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Audio-visual representation learning aims to develop systems with human-...
research
03/27/2018

Mittens: An Extension of GloVe for Learning Domain-Specialized Representations

We present a simple extension of the GloVe representation learning model...
research
04/26/2022

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Recent general-purpose audio representations show state-of-the-art perfo...
research
09/26/2022

The Ability of Self-Supervised Speech Models for Audio Representations

Self-supervised learning (SSL) speech models have achieved unprecedented...
research
11/08/2019

DZip: improved general-purpose lossless compression based on novel neural network modeling

We consider lossless compression based on statistical data modeling foll...
research
03/06/2022

HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downst...
research
10/28/2022

SG-VAD: Stochastic Gates Based Speech Activity Detection

We propose a novel voice activity detection (VAD) model in a low-resourc...

Please sign up or login with your details

Forgot password? Click here to reset