Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments

by   Kai Niu, et al.

Description-based person re-identification (Re-id) is an important task in video surveillance that requires discriminative cross-modal representations to distinguish different people. It is difficult to directly measure the similarity between images and descriptions due to the modality heterogeneity (the cross-modal problem). And all samples belonging to a single category (the fine-grained problem) makes this task even harder than the conventional image-description matching task. In this paper, we propose a Multi-granularity Image-text Alignments (MIA) model to alleviate the cross-modal fine-grained problem for better similarity evaluation in description-based person Re-id. Specifically, three different granularities, i.e., global-global, global-local and local-local alignments are carried out hierarchically. Firstly, the global-global alignment in the Global Contrast (GC) module is for matching the global contexts of images and descriptions. Secondly, the global-local alignment employs the potential relations between local components and global contexts to highlight the distinguishable components while eliminating the uninvolved ones adaptively in the Relation-guided Global-local Alignment (RGA) module. Thirdly, as for the local-local alignment, we match visual human parts with noun phrases in the Bi-directional Fine-grained Matching (BFM) module. The whole network combining multiple granularities can be end-to-end trained without complex pre-processing. To address the difficulties in training the combination of multiple granularities, an effective step training strategy is proposed to train these granularities step-by-step. Extensive experiments and analysis have shown that our method obtains the state-of-the-art performance on the CUHK-PEDES dataset and outperforms the previous methods by a significant margin.


page 1

page 2

page 3

page 4

page 6

page 9

page 12


Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Text-based person search is a challenging task that aims to search pedes...

Step-Wise Hierarchical Alignment Network for Image-Text Matching

Image-text matching plays a central role in bridging the semantic gap be...

Cross-modal Local Shortest Path and Global Enhancement for Visible-Thermal Person Re-Identification

In addition to considering the recognition difficulty caused by human po...

AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID

Cross-modal person re-identification (Re-ID) is critical for modern vide...

Cascade Attention Network for Person Search: Both Image and Text-Image Similarity Selection

Person search with natural language aims to retrieve the corresponding p...

Dual-path CNN with Max Gated block for Text-Based Person Re-identification

Text-based person re-identification(Re-id) is an important task in video...

Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

Human-Object Interaction (HOI) detection is a challenging computer visio...

Please sign up or login with your details

Forgot password? Click here to reset