On The Robustness of Self-Supervised Representations for Spoken Language Modeling

09/30/2022
by   Itai Gat, et al.
8

Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach.

READ FULL TEXT
research
01/02/2023

Analysing Discrete Self Supervised Speech Representation for Spoken Language Modeling

This work profoundly analyzes discrete self-supervised speech representa...
research
11/23/2020

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

We introduce a new unsupervised task, spoken language modeling: the lear...
research
06/02/2023

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Self-supervised techniques for learning speech representations have been...
research
07/17/2021

Learning De-identified Representations of Prosody from Raw Audio

We propose a method for learning de-identified prosody representations f...
research
05/19/2023

North Sámi Dialect Identification with Self-supervised Speech Models

The North Sámi (NS) language encapsulates four primary dialectal variant...
research
02/07/2022

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

In this paper, we describe our submissions to the ZeroSpeech 2021 Challe...

Please sign up or login with your details

Forgot password? Click here to reset