A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

11/30/2020
by   Rafi Trad, et al.
0

Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints). We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500 compared to state-of-the-arts. We also demonstrate that, while prior knowledge on the precise number of authors (i.e. authorial clusters) does not contribute much to additional quality, little knowledge on constraints in authorial clusters memberships leads to clear performance improvements in front of this difficult task. Thorough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2021

Author Clustering and Topic Estimation for Short Texts

Analysis of short text, such as social media posts, is extremely difficu...
research
09/22/2016

Bibliographic Analysis with the Citation Network Topic Model

Bibliographic analysis considers author's research areas, the citation n...
research
05/16/2022

Quantitative Discourse Cohesion Analysis of Scientific Scholarly Texts using Multilayer Networks

Discourse cohesion facilitates text comprehension and helps the reader f...
research
12/02/2020

Analyzing Stylistic Variation across Different Political Regimes

In this article we propose a stylistic analysis of texts written across ...
research
02/01/2022

A Semi-Supervised Deep Clustering Pipeline for Mining Intentions From Texts

Mining the latent intentions from large volumes of natural language inpu...
research
11/27/2012

A simple non-parametric Topic Mixture for Authors and Documents

This article reviews the Author-Topic Model and presents a new non-param...
research
09/02/2023

MPTopic: Improving topic modeling via Masked Permuted pre-training

Topic modeling is pivotal in discerning hidden semantic structures withi...

Please sign up or login with your details

Forgot password? Click here to reset