Extreme-K categorical samples problem

07/29/2020
by   Elizabeth Chou, et al.
0

With histograms as its foundation, we develop Categorical Exploratory Data Analysis (CEDA) under the extreme-K sample problem, and illustrate its universal applicability through four 1D categorical datasets. Given a sizable K, CEDA's ultimate goal amounts to discover by data's information content via carrying out two data-driven computational tasks: 1) establish a tree geometry upon K populations as a platform for discovering a wide spectrum of patterns among populations; 2) evaluate each geometric pattern's reliability. In CEDA developments, each population gives rise to a row vector of categories proportions. Upon the data matrix's row-axis, we discuss the pros and cons of Euclidean distance against its weighted version for building a binary clustering tree geometry. The criterion of choice rests on degrees of uniformness in column-blocks framed by this binary clustering tree. Each tree-leaf (population) is then encoded with a binary code sequence, so is tree-based pattern. For evaluating reliability, we adopt row-wise multinomial randomness to generate an ensemble of matrix mimicries, so an ensemble of mimicked binary trees. Reliability of any observed pattern is its recurrence rate within the tree ensemble. A high reliability value means a deterministic pattern. Our four applications of CEDA illuminate four significant aspects of extreme-K sample problems.

READ FULL TEXT

page 8

page 11

research
06/27/2012

Inferring Latent Structure From Mixed Real and Categorical Relational Data

We consider analysis of relational data (a matrix), in which the rows co...
research
07/31/2020

Denoising individual bias for a fairer binary submatrix detection

Low rank representation of binary matrix is powerful in disentangling sp...
research
01/31/2018

Coupling geometry on binary bipartite networks: hypotheses testing on pattern geometry and nestedness

Upon a matrix representation of a binary bipartite network, via the perm...
research
03/23/2021

Weak convergence of U-statistics on a row-column exchangeable matrix

U-statistics are used to estimate a population parameter by averaging a ...
research
01/26/2018

Information Content of a Phylogenetic Tree in a Data Matrix

Phylogenetic trees in genetics and biology in general are all binary. We...
research
04/16/2019

A Pattern-Hierarchy Classifier for Reduced Teaching

This paper uses a branching classifier mechanism in an unsupervised scen...

Please sign up or login with your details

Forgot password? Click here to reset