HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval

by   Jie Guo, et al.

Image-text retrieval (ITR) is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. In recent years, researchers have made great progress in exploring the accurate alignment between image and text. However, existing works mainly focus on the fine-grained alignment between image regions and sentence fragments, which ignores the guiding significance of context background information. Actually, integrating the local fine-grained information and global context background information can provide more semantic clues for retrieval. In this paper, we propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module, which enhances the semantic corresponding relations between the local and global information, and obtains more accurate feature representations for the image and text modalities. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment. To justify the proposed model, we perform extensive experiments on MS-COCO and Flickr30K datasets. Experimental results show that the proposed HGAN outperforms the state-of-the-art methods on both datasets, which demonstrates the effectiveness and superiority of our model.


page 1

page 4

page 10

page 11


Step-Wise Hierarchical Alignment Network for Image-Text Matching

Image-text matching plays a central role in bridging the semantic gap be...

Hierarchical Matching and Reasoning for Multi-Query Image Retrieval

As a promising field, Multi-Query Image Retrieval (MQIR) aims at searchi...

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

While recent progress in video-text retrieval has been advanced by the e...

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Image-Text Matching is one major task in cross-modal information process...

Boosting Text-to-Image Diffusion Models with Fine-Grained Semantic Rewards

Recent advances in text-to-image diffusion models have achieved remarkab...

Reusing Keywords for Fine-grained Representations and Matchings

Question retrieval aims to find the semantically equivalent questions fo...

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Learning semantic correspondence between image and text is significant a...

Please sign up or login with your details

Forgot password? Click here to reset