Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

04/11/2022
by   Vishal Sunder, et al.
0

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets. Our model improves further when fine-tuned with additional regularization using SpecAugment especially when speech is noisy, giving an absolute improvement as high as 8

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2020

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Spoken language understanding is typically based on pipeline architectur...
research
12/16/2022

Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language

End-to-end text-to-speech synthesis (TTS) can generate highly natural sy...
research
12/29/2020

CMV-BERT: Contrastive multi-vocab pretraining of BERT

In this work, we represent CMV-BERT, which improves the pretraining of a...
research
10/08/2020

Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

Training an end-to-end (E2E) neural network speech-to-intent (S2I) syste...
research
06/13/2023

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Lyrics alignment gained considerable attention in recent years. State-of...
research
11/30/2022

Topological Data Analysis for Speech Processing

We apply topological data analysis (TDA) to speech classification proble...
research
09/14/2023

Revisiting Supertagging for HPSG

We present new supertaggers trained on HPSG-based treebanks. These treeb...

Please sign up or login with your details

Forgot password? Click here to reset