Topic Segmentation in the Wild: Towards Segmentation of Semi-structured Unstructured Chats

11/27/2022
by   Reshmi Ghosh, et al.
13

Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured texts. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2023

Pre-training Language Models for Comparative Reasoning

In this paper, we propose a novel framework to pre-train language models...
research
07/23/2019

Structured Knowledge Discovery from Massive Text Corpus

Nowadays, with the booming development of the Internet, people benefit f...
research
07/30/2015

Tag-Weighted Topic Model For Large-scale Semi-Structured Documents

To date, there have been massive Semi-Structured Documents (SSDs) during...
research
04/30/2020

Text Segmentation by Cross Segment Attention

Document and discourse segmentation are two fundamental NLP tasks pertai...
research
03/18/2022

Graph-Text Multi-Modal Pre-training for Medical Representation Learning

As the volume of Electronic Health Records (EHR) sharply grows, there ha...
research
03/29/2019

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

Extracting key information from documents, such as receipts or invoices,...
research
08/21/2023

Unsupervised Dialogue Topic Segmentation in Hyperdimensional Space

We present HyperSeg, a hyperdimensional computing (HDC) approach to unsu...

Please sign up or login with your details

Forgot password? Click here to reset