Universal and non-universal text statistics: Clustering coefficient for language identification

11/18/2019
by   Diego Espitia, et al.
0

In this work we analyze statistical properties of 91 relatively small texts in 7 different languages (Spanish, English, French, German, Turkish, Russian, Icelandic) as well as texts with randomly inserted spaces. Despite the size (around 11260 different words), the well known universal statistical laws – namely Zipf and Herdan-Heap's laws – are confirmed, and are in close agreement with results obtained elsewhere. We also construct a word co-occurrence network of each text. While the degree distribution is again universal, we note that the distribution of Clustering Coefficients, which depend strongly on the local structure of networks, can be used to differentiate between languages, as well as to distinguish natural languages from random texts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2018

Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages

We demonstrate that large texts, representing human (English, Russian, U...
research
05/25/2016

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

There have been multiple attempts to resolve various inflection matching...
research
04/09/2020

Two halves of a meaningful text are statistically different

Which statistical features distinguish a meaningful text (possibly writt...
research
12/21/2022

Universal versus system-specific features of punctuation usage patterns in major Western languages

The celebrated proverb that "speech is silver, silence is golden" has a ...
research
07/24/2023

Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models

The article introduces corrections to Zipf's and Heaps' laws based on sy...
research
11/18/2016

Statistical Properties of European Languages and Voynich Manuscript Analysis

The statistical properties of letters frequencies in European literature...
research
07/15/2015

Language discrimination and clustering via a neural network approach

We classify twenty-one Indo-European languages starting from written tex...

Please sign up or login with your details

Forgot password? Click here to reset