Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

by   Rui Zhao, et al.

Cross-modal video-text retrieval, a challenging task in the field of vision and language, aims at retrieving corresponding instance giving sample from either modality. Existing approaches for this task all focus on how to design encoding model through a hard negative ranking loss, leaving two key problems unaddressed during this procedure. First, in the training stage, only a mini-batch of instance pairs is available in each iteration. Therefore, this kind of hard negatives is locally mined inside a mini-batch while ignoring the global negative samples among the dataset. Second, there are many text descriptions for one video and each text only describes certain local features of a video. Previous works for this task did not consider to fuse the multiply texts corresponding to a video during the training. In this paper, to solve the above two problems, we propose a novel memory enhanced embedding learning (MEEL) method for videotext retrieval. To be specific, we construct two kinds of memory banks respectively: cross-modal memory module and text center memory module. The cross-modal memory module is employed to record the instance embeddings of all the datasets for global negative mining. To avoid the fast evolving of the embedding in the memory bank during training, we utilize a momentum encoder to update the features by a moving-averaging strategy. The text center memory module is designed to record the center information of the multiple textual instances corresponding to a video, and aims at bridging these textual instances together. Extensive experimental results on two challenging benchmarks, i.e., MSR-VTT and VATEX, demonstrate the effectiveness of the proposed method.


page 1

page 8


T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Text-video retrieval is a challenging task that aims to search relevant ...

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where relat...

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video u...

USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval

As a fundamental and challenging task in bridging language and vision do...

A Proposal-based Approach for Activity Image-to-Video Retrieval

Activity image-to-video retrieval task aims to retrieve videos containin...

Progressive Feature Mining and External Knowledge-Assisted Text-Pedestrian Image Retrieval

Text-Pedestrian Image Retrieval aims to use the text describing pedestri...

Improving memory banks for unsupervised learning with large mini-batch, consistency and hard negative mining

An important component of unsupervised learning by instance-based discri...

Please sign up or login with your details

Forgot password? Click here to reset