Robustness of sentence length measures in written texts

05/02/2018
by   Denner S. Vieira, et al.
0

Hidden structural patterns in written texts have been subject of considerable research in the last decades. In particular, mapping a text into a time series of sentence lengths is a natural way to investigate text structure. Typically, sentence length has been quantified by using measures based on the number of words and the number of characters, but other variations are possible. To quantify the robustness of different sentence length measures, we analyzed a database containing about five hundred books in English. For each book, we extracted six distinct measures of sentence length, including number of words and number of characters (taking into account lemmatization and stop words removal). We compared these six measures for each book by using i) Pearson's coefficient to investigate linear correlations; ii) Kolmogorov--Smirnov test to compare distributions; and iii) detrended fluctuation analysis (DFA) to quantify auto-correlations. We have found that all six measures exhibit very similar behavior, suggesting that sentence length is a robust measure related to text structure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/17/2023

Contrasting Linguistic Patterns in Human and LLM-Generated Text

We conduct a quantitative analysis contrasting human-written English new...
research
08/25/2020

Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

This study investigates global properties of literary and non-literary t...
research
08/19/2022

Characterizing narrative time in books through fluctuations in power and danger arcs

While recent studies have focused on quantifying word usage to find the ...
research
09/28/2017

The Dependence of Frequency Distributions on Multiple Meanings of Words, Codes and Signs

The dependence of the frequency distributions due to multiple meanings o...
research
02/16/2023

Tragic and Comical Networks. Clustering Dramatic Genres According to Structural Properties

There is a growing tradition in the joint field of network studies and d...
research
01/27/2016

Co-Occurrence Patterns in the Voynich Manuscript

The Voynich Manuscript is a medieval book written in an unknown script. ...
research
08/22/2022

The optimality of word lengths. Theoretical foundations and an empirical study

One of the most robust patterns found in human languages is Zipf's law o...

Please sign up or login with your details

Forgot password? Click here to reset