Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up

by   Dominic Richards, et al.
University of Oxford

We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations that scales, up to a threshold, with the inverse of the spectral gap of the gossip matrix divided by the number of samples owned by each agent raised to a problem-dependent power. The presence of the threshold comes from statistics. It encodes the existence of a "big data" regime where the number of required iterations does not depend on the network topology. In this regime, Distributed Gradient Descent achieves optimal statistical rates with the same order of iterations as gradient descent run with all the samples in the network. Provided the communication delay is sufficiently small, the distributed protocol yields a linear speed-up in runtime compared to the single-machine protocol. This is in contrast to decentralised optimisation algorithms that do not exploit statistics and only yield a linear speed-up in graphs where the spectral gap is bounded away from zero. Our results exploit the statistical concentration of quantities held by agents and shed new light on the interplay between statistics and communication in decentralised methods. Bounds are given in the standard non-parametric setting with source/capacity assumptions.


Decentralised Learning with Random Features and Distributed Gradient Descent

We investigate the generalisation performance of Distributed Gradient De...

How many Neurons do we need? A refined Analysis for Shallow Networks trained with Gradient Descent

We analyze the generalization properties of two-layer neural networks in...

Generalization and Stability of Interpolating Neural Networks with Minimal Width

We investigate the generalization and optimization of k-homogeneous shal...

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

We consider the optimization of a quadratic objective function whose gra...

Iterative Pre-Conditioning to Expedite the Gradient-Descent Method

Gradient-descent method is one of the most widely used and perhaps the m...

Improved Communication Lower Bounds for Distributed Optimisation

Motivated by the interest in communication-efficient methods for distrib...

Stability of Decentralized Gradient Descent in Open Multi-Agent Systems

The aim of decentralized gradient descent (DGD) is to minimize a sum of ...

Please sign up or login with your details

Forgot password? Click here to reset