CausalLM is not optimal for in-context learning

08/14/2023
by   Nan Ding, et al.
0

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2023

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Large language models (LLMs) are known for their exceptional performance...
research
06/01/2023

Transformers learn to implement preconditioned gradient descent for in-context learning

Motivated by the striking ability of transformers for in-context learnin...
research
12/15/2022

Transformers learn in-context by gradient descent

Transformers have become the state-of-the-art neural network architectur...
research
12/20/2022

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

Large pretrained language models have shown surprising In-Context Learni...
research
06/03/2021

Linear regression with partially mismatched data: local search with theoretical guarantees

Linear regression is a fundamental modeling tool in statistics and relat...
research
07/15/2023

Transformers are Universal Predictors

We find limits to the Transformer architecture for language modeling and...
research
09/12/2023

How does representation impact in-context learning: A exploration on a synthetic task

In-context learning, i.e., learning from in-context samples, is an impre...

Please sign up or login with your details

Forgot password? Click here to reset