SLICER: Learning universal audio representations using low-resource self-supervised pre-training

11/02/2022
by   Ashish Seth, et al.
0

We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of clusters. In addition, we propose a novel mel-spectrogram augmentation procedure, k-mix, based on mixup, which does not require labels and aids unsupervised representation learning for audio. Overall, SLICER achieves state-of-the-art results on the LAPE Benchmark <cit.>, significantly outperforming DeLoRes-M and other prior approaches, which are pre-trained on 10× larger of unsupervised data. We will make all our codes available on GitHub.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2021

Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

Representation learning from unlabeled data has been of major interest i...
research
03/10/2023

UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation

In this paper, we introduce UnFuSeD, a novel approach to leverage self-s...
research
05/02/2023

Self-supervised learning for infant cry analysis

In this paper, we explore self-supervised learning (SSL) for analyzing a...
research
05/28/2023

Investigating Pre-trained Audio Encoders in the Low-Resource Condition

Pre-trained speech encoders have been central to pushing state-of-the-ar...
research
02/12/2022

Wav2Vec2.0 on the Edge: Performance Evaluation

Wav2Vec2.0 is a state-of-the-art model which learns speech representatio...
research
03/25/2022

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Inspired by the recent progress in self-supervised learning for computer...
research
07/14/2023

Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications

The representation learning of speech, without textual resources, is an ...

Please sign up or login with your details

Forgot password? Click here to reset