EVA-02: A Visual Representation for Neon Genesis

03/20/2023
by   Yuxin Fang, et al.
0

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest best open-sourced CLIP with only  1/6 parameters and  1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at https://github.com/baaivision/EVA/tree/master/EVA-02.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/27/2023

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Contrastive language-image pre-training, CLIP for short, has gained incr...
research
07/13/2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

Most existing Vision-and-Language (V L) models rely on pre-trained vis...
research
07/22/2023

Sparse then Prune: Toward Efficient Vision Transformers

The Vision Transformer architecture is a deep learning model inspired by...
research
12/12/2022

CLIP Itself is a Strong Fine-tuner: Achieving 85.7 Accuracy with ViT-B and ViT-L on ImageNet

Recent studies have shown that CLIP has achieved remarkable success in p...
research
11/18/2021

SimMIM: A Simple Framework for Masked Image Modeling

This paper presents SimMIM, a simple framework for masked image modeling...
research
11/14/2022

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

We launch EVA, a vision-centric foundation model to explore the limits o...
research
07/27/2022

VICTOR: Visual Incompatibility Detection with Transformers and Fashion-specific contrastive pre-training

For fashion outfits to be considered aesthetically pleasing, the garment...

Please sign up or login with your details

Forgot password? Click here to reset