One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

by   Arvind Mahankali, et al.

Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Akyürek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of pre-conditioned GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of nonlinear functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.


page 1

page 2

page 3

page 4


Transformers learn in-context by gradient descent

Transformers have become the state-of-the-art neural network architectur...

Transformers learn to implement preconditioned gradient descent for in-context learning

Motivated by the striking ability of transformers for in-context learnin...

Trained Transformers Learn Linear Models In-Context

Attention-based neural networks such as transformers have demonstrated a...

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

Neural sequence models based on the transformer architecture have demons...

In-Context Learning through the Bayesian Prism

In-context learning is one of the surprising and useful features of larg...

What learning algorithm is in-context learning? Investigations with linear models

Neural sequence models, especially transformers, exhibit a remarkable ca...

A Closer Look at In-Context Learning under Distribution Shifts

In-context learning, a capability that enables a model to learn from inp...

Please sign up or login with your details

Forgot password? Click here to reset