Relating Zipf's law to textual information

09/22/2018
by   Weibing Deng, et al.
0

Zipf's law is the main regularity of quantitative linguistics. Despite of many works devoted to foundations of this law, it is still unclear whether it is only a statistical regularity, or it has deeper relations with information-carrying structures of the text. This question relates to that of distinguishing a meaningful text (written in an unknown system) from a meaningless set of symbols that mimics statistical features of a text. Here we contribute to resolving these questions by comparing features of the first half of a text (from the beginning to the middle) to its second half. This comparison can uncover hidden effects, because the halves have the same values of many parameters (style, genre, author's vocabulary etc). In all studied texts we saw that for the first half Zipf's law applies from smaller ranks than in the second half, i.e. the law applies better to the first half. Also, words that hold Zipf's law in the first half are distributed more homogeneously over the text. These features do allow to distinguish a meaningful text from a random sequence of words. Our findings correlate with a number of textual characteristics that hold in most cases we studied: the first half is lexically richer, has longer and less repetitive words, more and shorter sentences, more punctuation signs and more paragraphs. These differences between the halves indicate on a higher hierarchic level of text organization that so far went unnoticed in text linguistics. They relate the validity of Zipf's law to textual information. A complete description of this effect requires new models, though one existing model can account for some of its aspects.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/09/2020

Two halves of a meaningful text are statistically different

Which statistical features distinguish a meaningful text (possibly writt...
research
01/07/2020

Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance

We study the relationship between vocabulary size and text length in a c...
research
10/05/2015

Stochastic model for phonemes uncovers an author-dependency of their usage

We study rank-frequency relations for phonemes, the minimal units that s...
research
12/25/2019

A statistical test for correspondence of texts to the Zipf-Mandelbrot law

We analyse correspondence of a text to a simple probabilistic model. The...
research
04/09/2015

Concentric network symmetry grasps authors' styles in word adjacency networks

Several characteristics of written texts have been inferred from statist...
research
11/02/2022

There Are Fewer Facts Than Words: Communication With A Growing Complexity

We present an impossibility result, called a theorem about facts and wor...
research
05/01/2022

Textual Stylistic Variation: Choices, Genres and Individuals

This chapter argues for more informed target metrics for the statistical...

Please sign up or login with your details

Forgot password? Click here to reset