MLP-Mixer: An all-MLP Architecture for Vision

05/04/2021
by   Ilya Tolstikhin, et al.
18

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

READ FULL TEXT

page 9

page 15

research
03/26/2021

Understanding Robustness of Transformers for Image Classification

Deep Convolutional Neural Networks (CNNs) have long been the architectur...
research
07/02/2023

X-MLP: A Patch Embedding-Free MLP Architecture for Vision

Convolutional neural networks (CNNs) and vision transformers (ViT) have ...
research
10/22/2020

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for ...
research
06/13/2022

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

Convolutional Neural Networks (CNNs) have been regarded as the go-to mod...
research
06/30/2023

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Vision transformers (ViTs) have significantly changed the computer visio...
research
05/04/2022

Sequencer: Deep LSTM for Image Classification

In recent computer vision research, the advent of the Vision Transformer...
research
04/25/2023

iMixer: hierarchical Hopfield network implies an invertible, implicit and iterative MLP-Mixer

In the last few years, the success of Transformers in computer vision ha...

Please sign up or login with your details

Forgot password? Click here to reset