Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification

by   Haowei Zhu, et al.

Recently, self-attention mechanisms have shown impressive performance in various NLP and CV tasks, which can help capture sequential characteristics and derive global information. In this work, we explore how to extend self-attention modules to better learn subtle feature embeddings for recognizing fine-grained objects, e.g., different bird species or person identities. To this end, we propose a dual cross-attention learning (DCAL) algorithm to coordinate with self-attention learning. First, we propose global-local cross-attention (GLCA) to enhance the interactions between global images and local high-response regions, which can help reinforce the spatial-wise discriminative clues for recognition. Second, we propose pair-wise cross-attention (PWCA) to establish the interactions between image pairs. PWCA can regularize the attention learning of an image by treating another image as distractor and will be removed during inference. We observe that DCAL can reduce misleading attentions and diffuse the attention response to discover more complementary parts for recognition. We conduct extensive evaluations on fine-grained visual categorization and object re-identification. Experiments demonstrate that DCAL performs on par with state-of-the-art methods and consistently improves multiple self-attention baselines, e.g., surpassing DeiT-Tiny and ViT-Base by 2.8


page 3

page 8

page 13

page 14

page 15


SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization

Fine-grained visual categorization (FGVC) aims at recognizing objects fr...

Drawing Attention to Detail: Pose Alignment through Self-Attention for Fine-Grained Object Classification

Intra-class variations in the open world lead to various challenges in c...

SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization

Over the past few years, a significant progress has been made in deep co...

GlobalMind: Global Multi-head Interactive Self-attention Network for Hyperspectral Change Detection

High spectral resolution imagery of the Earth's surface enables users to...

Channel Interaction Networks for Fine-Grained Image Categorization

Fine-grained image categorization is challenging due to the subtle inter...

Cross-view Geo-localization with Evolving Transformer

In this work, we address the problem of cross-view geo-localization, whi...

Discover Your Social Identity from What You Tweet: a Content Based Approach

An identity denotes the role an individual or a group plays in highly di...

Please sign up or login with your details

Forgot password? Click here to reset