mSLAM: Massively multilingual joint pre-training for speech and text

02/03/2022
by   Ankur Bapna, et al.
0

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

Recent cross-lingual cross-modal works attempt to extend Vision-Language...
research
10/20/2021

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

Unsupervised pre-training is now the predominant approach for both text ...
research
10/31/2022

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Direct speech-to-speech translation (S2ST) is an attractive research top...
research
10/12/2022

SQuId: Measuring Speech Naturalness in Many Languages

Much of text-to-speech research relies on human evaluation, which incurs...
research
08/11/2023

Improving Joint Speech-Text Representations Without Alignment

The last year has seen astonishing progress in text-prompted image gener...
research
01/07/2023

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

Despite surprising performance on zero-shot transfer, pre-training a lar...
research
10/13/2022

JOIST: A Joint Speech and Text Streaming Model For ASR

We present JOIST, an algorithm to train a streaming, cascaded, encoder e...

Please sign up or login with your details

Forgot password? Click here to reset