Improving Video Retrieval by Adaptive Margin

by   Feng He, et al.

Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. However, negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent, while most methods still enforce dissimilar representations to decrease their similarity. This phenomenon leads to inaccurate supervision and poor performance in learning video-text representations. While most video retrieval methods overlook that phenomenon, we propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue. First, we design the calculation framework of the adaptive margin, including the method of distance measurement and the function between the distance and the margin. Then, we explore a novel implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD), which can be built on the top of most video retrieval models with few modifications. Notably, CMGSD adds few computational overheads at train time and adds no computational overhead at test time. Experimental results on three widely used datasets demonstrate that the proposed method can yield significantly better performance than the corresponding backbone model, and it outperforms state-of-the-art methods by a large margin.


Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations – noise co...

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations

Contrastive learning allows us to flexibly define powerful losses by con...

Design of the topology for contrastive visual-textual alignment

Pre-training weakly related image-text pairs in the contrastive style sh...

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Video-Text Retrieval has been a hot research topic with the explosion of...

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

We present a new state-of-the-art on the text to video retrieval task on...

StratLearner: Learning a Strategy for Misinformation Prevention in Social Networks

Given a combinatorial optimization problem taking an input, can we learn...

Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks

Searching vast troves of videos with textual descriptions is a core mult...

Please sign up or login with your details

Forgot password? Click here to reset