Jaccard/Tanimoto similarity test and estimation methods

by   Neo Christopher Chung, et al.

Binary data are used in a broad area of biological sciences. Using binary presence-absence data, we can evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. We derived the exact and asymptotic solutions and developed the bootstrap and measurement concentration algorithms to compute statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard). We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.


page 15

page 17


fbst: An R package for the Full Bayesian Significance Test for testing a sharp null hypothesis against its alternative via the e-value

Hypothesis testing is a central statistical method in psychology and the...

A statistical normalization method and differential expression analysis for RNA-seq data between different species

Background: High-throughput techniques bring novel tools but also statis...

Weighted Tanimoto Coefficient for 3D Molecule Structure Similarity Measurement

Similarity searching of molecular structure has been an important applic...

Model selection for ecological community data using tree shrinkage priors

Researchers and managers model ecological communities to infer the bioti...

A comparison of different clustering approaches for high-dimensional presence-absence data

Presence-absence data is defined by vectors or matrices of zeroes and on...

Multivariate Analysis and Visualization using R Package muvis

Increased application of multivariate data in many scientific areas has ...

An exact test for significance of clusters in binary data

Unsupervised clustering of feature matrix data is an indispensible techn...

Please sign up or login with your details

Forgot password? Click here to reset