Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

by   Mohan Shi, et al.

For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a frame-level punctuation prediction task is added to the semantic VAD, and the artificial endpoint is included in the classification category in addition to the often-used speech presence and absence. To enhance the semantic information of the model, we also incorporate an automatic speech recognition (ASR) related semantic loss. Evaluations on an internal dataset show that the proposed method can reduce the average latency by 53.3 significant deterioration of character error rate in the back-end ASR compared to the traditional VAD approach.


Streaming Speech-to-Confusion Network Speech Recognition

In interactive automatic speech recognition (ASR) systems, low-latency r...

Dynamic Speech Endpoint Detection with Regression Targets

Interactive voice assistants have been widely used as input interfaces i...

Building Accurate Low Latency ASR for Streaming Voice Search

Automatic Speech Recognition (ASR) plays a crucial role in voice-based a...

Long-Running Speech Recognizer:An End-to-End Multi-Task Learning Framework for Online ASR and VAD

When we use End-to-end automatic speech recognition (E2E-ASR) system for...

A low latency ASR-free end to end spoken language understanding system

In recent years, developing a speech understanding system that classifie...

Please sign up or login with your details

Forgot password? Click here to reset