# The Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test. Indeed, Wilks' theorem asserts that whenever we have a fixed number p of variables, twice the log-likelihood ratio (LLR) 2Λ is distributed as a χ^2_k variable in the limit of large sample sizes n; here, k is the number of variables being tested. In this paper, we prove that when p is not negligible compared to n, Wilks' theorem does not hold and that the chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis). Assume that n and p grow large in such a way that p/n→κ for some constant κ < 1/2. We prove that for a class of logistic models, the LLR converges to a rescaled chi-square, namely, 2Λ d→ α(κ)χ_k^2, where the scaling factor α(κ) is greater than one as soon as the dimensionality ratio κ is positive. Hence, the LLR is larger than classically assumed. For instance, when κ=0.3, α(κ)≈1.5. In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, non-asymptotic random matrix theory and convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes. Finally, all the results from this paper extend to some other regression models such as the probit regression model.

• 12 publications
• 101 publications
• 39 publications
research
03/19/2018

### A modern maximum-likelihood theory for high-dimensional logistic regression

Every student in statistics or data science learns early on that when th...
research
07/18/2021

### Regression model selection via log-likelihood ratio and constrained minimum criterion

Although the log-likelihood is widely used in model selection, the log-l...
research
03/23/2023

### Logistic Regression Equivalence: A Framework for Comparing Logistic Regression Models Across Populations

In this paper we discuss how to evaluate the differences between fitted ...
research
05/28/2023

### Multinomial Logistic Regression: Asymptotic Normality on Null Covariates in High-Dimensions

This paper investigates the asymptotic distribution of the maximum-likel...
research
08/14/2020

### A Note on the Likelihood Ratio Test in High-Dimensional Exploratory Factor Analysis

The likelihood ratio test is widely used in exploratory factor analysis ...
research
02/11/2020

### Computationally efficient univariate filtering for massive data

The vast availability of large scale, massive and big data has increased...
research
10/09/2020

### Sparse network asymptotics for logistic regression

Consider a bipartite network where N consumers choose to buy or not to b...