Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation

by   Qinglin Zhang, et al.
Renmin University of China
Alibaba Group

Transcripts generated by automatic speech recognition (ASR) systems for spoken documents lack structural annotations such as paragraphs, significantly reducing their readability. Automatically predicting paragraph segmentation for spoken documents may both improve readability and downstream NLP performance such as summarization and machine reading comprehension. We propose a sequence model with self-adaptive sliding window for accurate and efficient paragraph segmentation. We also propose an approach to exploit phonetic information, which significantly improves robustness of spoken document segmentation to ASR errors. Evaluations are conducted on the English Wiki-727K document segmentation benchmark, a Chinese Wikipedia-based document segmentation dataset we created, and an in-house Chinese spoken document dataset. Our proposed model outperforms the state-of-the-art (SOTA) model based on the same BERT-Base, increasing segmentation F1 on the English benchmark by 4.2 points and on Chinese datasets by 4.3-10.1 points, while reducing inference time to less than 1/6 of inference time of the current SOTA.


page 1

page 2

page 3

page 4


Discriminative Self-training for Punctuation Prediction

Punctuation prediction for automatic speech recognition (ASR) output tra...

Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension

Reading comprehension has been widely studied. One of the most represent...

Knowledge Distillation for Improved Accuracy in Spoken Question Answering

Spoken question answering (SQA) is a challenging task that requires the ...

OLISIA: a Cascade System for Spoken Dialogue State Tracking

Though Dialogue State Tracking (DST) is a core component of spoken dialo...

Toward Unifying Text Segmentation and Long Document Summarization

Text segmentation is important for signaling a document's structure. Wit...

Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts

As the volume of long-form spoken-word content such as podcasts explodes...

MUG: A General Meeting Understanding and Generation Benchmark

Listening to long video/audio recordings from video conferencing and onl...

Please sign up or login with your details

Forgot password? Click here to reset