Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark

by   Feng Jiang, et al.

Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings. Such a process unveils the discourse topic structure of a document that benefits quickly grasping and understanding the overall context of the document from a higher level. However, research and applications in this field have been restrained due to the lack of proper paragraph-level topic representations and large-scale, high-quality corpora in Chinese compared to the success achieved in English. Addressing these issues, we introduce a hierarchical paragraph-level topic structure representation with title, subheading, and paragraph that comprehensively models the document discourse topic structure. In addition, we ensure a more holistic representation of topic distribution within the document by using sentences instead of keywords to represent sub-topics. Following this representation, we construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), four times larger than the previously largest one. We also employ a two-stage man-machine collaborative annotation method to ensure the high quality of the corpus both in form and semantics. Finally, we validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) by several strong baselines, and its efficacy has been preliminarily confirmed on the downstream task: discourse parsing. The representation, corpus, and benchmark we established will provide a solid foundation for future studies.


page 1

page 2

page 3

page 4


Coordinated Topic Modeling

We propose a new problem called coordinated topic modeling that imitates...

Explainable and Discourse Topic-aware Neural Language Understanding

Marrying topic models and language models exposes language understanding...

Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) ...

Outline Generation: Understanding the Inherent Content Structure of Documents

In this paper, we introduce and tackle the Outline Generation (OG) task,...

MUG: A General Meeting Understanding and Generation Benchmark

Listening to long video/audio recordings from video conferencing and onl...

Topic-driven Distant Supervision Framework for Macro-level Discourse Parsing

Discourse parsing, the task of analyzing the internal rhetorical structu...

Topic modelling discourse dynamics in historical newspapers

This paper addresses methodological issues in diachronic data analysis f...

Please sign up or login with your details

Forgot password? Click here to reset