Learning De-identified Representations of Prosody from Raw Audio

07/17/2021
by   Jack Weston, et al.
10

We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. We use minimum description length probing to show that our representations have selectively learned the subcomponents of non-timbral prosody, and that the product quantizer naturally disentangles them without using bottlenecks. We derive an information-theoretic definition of speech de-identifiability and use it to demonstrate that our prosody representations are less identifiable than other speech representations.

READ FULL TEXT
research
06/14/2023

SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Self-supervised learning (SSL) for speech representation has been succes...
research
10/27/2022

Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge

Recent progress in self-supervised or unsupervised machine learning has ...
research
11/23/2020

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

We introduce a new unsupervised task, spoken language modeling: the lear...
research
09/30/2022

On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Self-supervised representations have been extensively studied for discri...
research
07/18/2022

Contrastive Environmental Sound Representation Learning

Machine hearing of the environmental sound is one of the important issue...
research
06/04/2020

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

More than half of the 7,000 languages in the world are in imminent dange...
research
04/11/2022

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

We introduce a simple neural encoder architecture that can be trained us...

Please sign up or login with your details

Forgot password? Click here to reset