Understanding Shared Speech-Text Representations

04/27/2023
by   Gary Wang, et al.
0

Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-free domain adaptation, finding that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation. Second, we inspect the sim-ilarities between activations of unimodal (speech or text) encodersas compared to the activations of a shared encoder. We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders. We hypothesize that thispartially explains the effectiveness of the Maestro shared speech-textrepresentations.

READ FULL TEXT
research
09/04/2023

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Mapping two modalities, speech and text, into a shared representation sp...
research
10/23/2021

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brou...
research
10/20/2022

Improving Semi-supervised End-to-end Automatic Speech Recognition using CycleGAN and Inter-domain Losses

We propose a novel method that combines CycleGAN and inter-domain losses...
research
12/16/2022

BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

End-to-End speech-to-speech translation (S2ST) is generally evaluated wi...
research
11/01/2022

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, ...
research
10/21/2022

A Textless Metric for Speech-to-Speech Comparison

This paper proposes a textless speech-to-speech comparison metric that a...
research
04/05/2021

Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval

Speech-based image retrieval has been studied as a proxy for joint repre...

Please sign up or login with your details

Forgot password? Click here to reset