Correspondence Matters for Video Referring Expression Comprehension

by   Meng Cao, et al.

We investigate the problem of video Referring Expression Comprehension (REC), which aims to localize the referent objects described in the sentence to visual regions in the video frames. Despite the recent progress, existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects. To this end, we propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners. Firstly, we aim to build the inter-frame correlations for all existing instances within the frames. Specifically, we compute the inter-frame patch-wise cosine similarity to estimate the dense alignment and then perform the inter-frame contrastive learning to map them close in feature space. Secondly, we propose to build the fine-grained patch-word alignment to associate each patch with certain words. Due to the lack of this kind of detailed annotations, we also predict the patch-word correspondence through the cosine similarity. Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimal model designs. Notably, our inter-frame and cross-modal contrastive losses are plug-and-play functions and are applicable to any video REC architectures. For example, by building on top of Co-grounding, we boost the performance by 1.48 dataset.


page 1

page 2

page 8

page 10

page 11


Video Referring Expression Comprehension via Transformer with Content-aware Query

Video Referring Expression Comprehension (REC) aims to localize a target...

HunYuan_tvr for Text-Video Retrivial

Text-Video Retrieval plays an important role in multi-modal understandin...

Contrastive Video-Language Learning with Fine-grained Frame Sampling

Despite recent progress in video and language representation learning, t...

LocVTP: Video-Text Pre-training for Temporal Localization

Video-Text Pre-training (VTP) aims to learn transferable representations...

Learning Trajectory-Word Alignments for Video-Language Tasks

Aligning objects with words plays a critical role in Image-Language BERT...

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

The recent video grounding works attempt to introduce vanilla contrastiv...

Learning Shadow Correspondence for Video Shadow Detection

Video shadow detection aims to generate consistent shadow predictions am...

Please sign up or login with your details

Forgot password? Click here to reset