Do We Really Need Explicit Position Encodings for Vision Transformers?

02/22/2021
by   Xiangxiang Chu, et al.
0

Almost all visual transformers such as ViT or DeiT rely on predefined positional encodings to incorporate the order of each input token. These encodings are often implemented as learnable fixed-dimension vectors or sinusoidal functions of different frequencies, which are not possible to accommodate variable-length input sequences. This inevitably limits a wider application of transformers in vision, where many tasks require changing the input size on-the-fly. In this paper, we propose to employ a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. It is effortlessly implemented as what we call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework. Our new model with PEG is named Conditional Position encoding Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings. We obtain state-of-the-art results on the ImageNet classification task compared with visual Transformers to date. Our code will be made available at https://github.com/Meituan-AutoML/CPVT .

READ FULL TEXT

page 5

page 13

research
04/22/2021

Token Labeling: Training a 85.4 56M Parameters on ImageNet

This paper provides a strong baseline for vision transformers on the Ima...
research
05/31/2021

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...
research
11/22/2021

MetaFormer is Actually What You Need for Vision

Transformers have shown great potential in computer vision tasks. A comm...
research
12/20/2022

A Length-Extrapolatable Transformer

Position modeling plays a critical role in Transformers. In this paper, ...
research
07/13/2023

Transformer-based end-to-end classification of variable-length volumetric data

The automatic classification of 3D medical data is memory-intensive. Als...
research
04/18/2021

Demystifying the Better Performance of Position Encoding Variants for Transformer

Transformers are state of the art models in NLP that map a given input s...
research
09/15/2022

Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

Vision Transformers (ViTs) have proven to be effective, in solving 2D im...

Please sign up or login with your details

Forgot password? Click here to reset