Towards Evaluation of Cultural-scale Claims in Light of Topic Model Sampling Effects

12/15/2015
by   Jaimie Murdock, et al.
0

Cultural-scale models of full text documents are prone to over-interpretation by researchers making unintentionally strong socio-linguistic claims (Pechenick et al., 2015) without recognizing that even large digital libraries are merely samples of all the books ever produced. In this study, we test the sensitivity of the topic models to the sampling process by taking random samples of books in the Hathi Trust Digital Library from different areas of the Library of Congress Classification Outline. For each classification area, we train several topic models over the entire class with different random seeds, generating a set of spanning models. Then, we train topic models on random samples of books from the classification area, generating a set of sample models. Finally, we perform a topic alignment between each pair of models by computing the Jensen-Shannon distance (JSD) between the word probability distributions for each topic. We take two measures on each model alignment: alignment distance and topic overlap. We find that sample models with a large sample size typically have an alignment distance that falls in the range of the alignment distance between spanning models. Unsurprisingly, as sample size increases, alignment distance decreases. We also find that the topic overlap increases as sample size increases. However, the decomposition of these measures by sample size differs by number of topics and by classification area. We speculate that these measures could be used to find classes which have a common "canon" discussed among all books in the area, as shown by high topic overlap and low alignment distance even in small sample sizes.

READ FULL TEXT

page 1

page 2

page 3

research
09/12/2021

Multiscale Analysis of Count Data through Topic Alignment

Topic modeling is a popular method used to describe biological count dat...
research
11/06/2012

Sample Size Planning for Classification Models

In biospectroscopy, suitably annotated and statistically independent sam...
research
05/19/2019

Second Order Expansions for Sample Median with Random Sample Size

In practice, we often encounter situations where a sample size is not de...
research
01/25/2023

Improving the Inference of Topic Models via Infinite Latent State Replications

In text mining, topic models are a type of probabilistic generative mode...
research
06/08/2023

Estimation of Poverty Measures for Small Areas Under a Two-Fold Nested Error Linear Regression Model: Comparison of Two Methods

Demand for reliable statistics at a local area (small area) level has gr...
research
05/11/2021

The explicit formula of the distributions of the nonoverlapping words and its applications to statistical tests for random numbers

Bassino et al. 2010 and Regnier et al. 1998 showed the generating functi...
research
05/01/2018

Detecting Galaxy-Filament Alignments in the Sloan Digital Sky Survey III

Previous studies have shown the filamentary structures in the cosmic web...

Please sign up or login with your details

Forgot password? Click here to reset