Attention Enables Zero Approximation Error

02/24/2022
by   Zhiying Fang, et al.
0

Deep learning models have been widely applied in various aspects of daily life. Many variant models based on deep learning structures have achieved even better performances. Attention-based architectures have become almost ubiquitous in deep learning structures. Especially, the transformer model has now defeated the convolutional neural network in image classification tasks to become the most widely used tool. However, the theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal. These surprising theoretical results clearly explain the outstanding performances of the transformer model and may shed light on future modifications in real applications. We also provide some experiments to verify our theoretical result.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2020

Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help !

The multi-head self-attention of popular transformer models is widely us...
research
09/29/2020

Attention that does not Explain Away

Models based on the Transformer architecture have achieved better accura...
research
02/15/2022

The Quarks of Attention

Attention plays a fundamental role in both natural and artificial intell...
research
10/24/2022

Exploring Self-Attention for Crop-type Classification Explainability

Automated crop-type classification using Sentinel-2 satellite time serie...
research
06/30/2021

Dual Aspect Self-Attention based on Transformer for Remaining Useful Life Prediction

Remaining useful life prediction (RUL) is one of the key technologies of...
research
11/08/2022

DepthFormer: Multimodal Positional Encodings and Cross-Input Attention for Transformer-Based Segmentation Networks

Most approaches for semantic segmentation use only information from colo...

Please sign up or login with your details

Forgot password? Click here to reset