ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

03/24/2022
by   Zi-Chao Zhang, et al.
0

Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC).These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks.However, there are some limitations when applying ViT directly to FGVC.First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when handling fine-grained images with complex background and small objects.Second, a standard ViT only utilizes the class token in the final layer for classification, which is not enough to extract comprehensive fine-grained information. To address these issues, we propose a novel ViT based fine-grained object discriminator for FGVC tasks, ViT-FOD for short. Specifically, besides a ViT backbone, it further introduces three novel components, i.e, Attention Patch Combination (APC), Critical Regions Filter (CRF), and Complementary Tokens Integration (CTI). Thereinto, APC pieces informative patches from two images to generate a new image so that the redundant calculation can be reduced. CRF emphasizes tokens corresponding to discriminative regions to generate a new class token for subtle feature learning. To extract comprehensive information, CTI integrates complementary information captured by class tokens in different ViT layers. We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
03/14/2021

TransFG: A Transformer Architecture for Fine-grained Recognition

Fine-grained visual classification (FGVC) which aims at recognizing obje...
research
04/21/2022

R2-Trans:Fine-Grained Visual Categorization with Redundancy Reduction

Fine-grained visual categorization (FGVC) aims to discriminate similar s...
research
09/07/2021

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Transformer, as a strong and flexible architecture for modelling long-ra...
research
07/14/2021

Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition

Fine-grained image recognition is challenging because discriminative clu...
research
03/08/2022

Coarse-to-Fine Vision Transformer

Vision Transformers (ViT) have made many breakthroughs in computer visio...
research
08/10/2023

Learning Gabor Texture Features for Fine-Grained Recognition

Extracting and using class-discriminative features is critical for fine-...
research
10/07/2022

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Transformers allow attention between all pairs of tokens, but there is r...

Please sign up or login with your details

Forgot password? Click here to reset