Spectral Analysis of Word Statistics

12/01/2020
by   Chaim Even-Zohar, et al.
0

Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all linear combinations of subword statistics, and fully characterize their different orders of magnitude using diverse algebraic tools. Moreover, we establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics. Our techniques include and elaborate on a set of algebraic word operators, recently studied and employed by Dieker and Saliola (Adv Math, 2018). Subword counts find applications in Combinatorics, Statistics, and Computer Science. We revisit special cases from the combinatorial literature, such as intransitive dice, random core partitions, and questions on random walk. Our structural approach describes in a unified framework several classical statistical tests. We propose further potential applications to data analysis and machine learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2019

Asymptotic joint distribution of extreme eigenvalues and trace of large sample covariance matrix in a generalized spiked population model

This paper studies the joint limiting behavior of extreme eigenvalues an...
research
04/07/2021

Spectral statistics of high dimensional sample covariance matrix with unbounded population spectral norm

In this paper, we establish some new central limit theorems for certain ...
research
07/13/2022

Spectral Statistics of Sample Block Correlation Matrices

A fundamental concept in multivariate statistics, sample correlation mat...
research
08/26/2017

Mahonian STAT on rearrangement class of words

In 2000, Babson and Steingrímsson generalized the notion of permutation ...
research
12/21/2019

Foundations of Structural Statistics: Topological Statistical Theory

Topological Statistical Theory, provides the foundation for a new unders...
research
01/14/2016

Linear Algebraic Structure of Word Senses, with Applications to Polysemy

Word embeddings are ubiquitous in NLP and information retrieval, but it'...
research
12/15/2020

Spectral Methods for Data Science: A Statistical Perspective

Spectral methods have emerged as a simple yet surprisingly effective app...

Please sign up or login with your details

Forgot password? Click here to reset