Birth of a Transformer: A Memory Viewpoint

06/01/2023
by   Alberto Bietti, et al.
0

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/01/2020

Modifying Memories in Transformer Models

Large Transformer models have achieved impressive performance in many na...
research
04/26/2023

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Large language models (LLMs) are known for their exceptional performance...
research
07/15/2023

Transformers are Universal Predictors

We find limits to the Transformer architecture for language modeling and...
research
05/24/2023

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning application...
research
03/14/2023

A Theory of Emergent In-Context Learning as Implicit Structure Induction

Scaling large language models (LLMs) leads to an emergent capacity to le...
research
02/22/2023

Considering Layerwise Importance in the Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis (LTH) showed that by iteratively training ...

Please sign up or login with your details

Forgot password? Click here to reset