Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning

by   Qian Chen, et al.

In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU models have been proposed to directly map speech input to desired semantic frame with a single model, hence mitigating ASR error propagation. Recently, pre-training technologies have been explored for these E2E models. In this paper, we propose a novel joint textual-phonetic pre-training approach for learning spoken language representations, aiming at exploring the full potentials of phonetic information to improve SLU robustness to ASR errors. We explore phoneme labels as high-level speech features, and design and compare pre-training tasks based on conditional masked language model objectives and inter-sentence relation objectives. We also investigate the efficacy of combining textual and phonetic information during fine-tuning. Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models and improves robustness of spoken language understanding to ASR errors.


page 1

page 2

page 3

page 4


Speech-language Pre-training for End-to-end Spoken Language Understanding

End-to-end (E2E) spoken language understanding (SLU) can infer semantics...

Pre-Training for Query Rewriting in A Spoken Language Understanding System

Query rewriting (QR) is an increasingly important technique to reduce cu...

Semi-Supervised Speech-Language Joint Pre-Training for Spoken Language Understanding

Spoken language understanding (SLU) requires a model to analyze input ac...

End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

Spoken Language Understanding (SLU) is a core task in most human-machine...

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Automatic Speech Recognition (ASR) is a technology that converts spoken ...

ASR error management for improving spoken language understanding

This paper addresses the problem of automatic speech recognition (ASR) e...

Two-stage Textual Knowledge Distillation to Speech Encoder for Spoken Language Understanding

End-to-end approaches open a new way for more accurate and efficient spo...

Please sign up or login with your details

Forgot password? Click here to reset