Disentangled Representation Learning for Text-Video Retrieval

03/14/2022
by   Qiang Wang, et al.
0

Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance. This paper first studies the interaction paradigm in depth, where we find that its computation can be split into two terms, the interaction contents at different granularity and the matching function to distinguish pairs with the same semantics. We also observe that the single-vector representation and implicit intensive function substantially hinder the optimization. Based on these findings, we propose a disentangled framework to capture a sequential and hierarchical representation. Firstly, considering the natural sequential structure in both text and video inputs, a Weighted Token-wise Interaction (WTI) module is performed to decouple the content and adaptively exploit the pair-wise correlations. This interaction can form a better disentangled manifold for sequential inputs. Secondly, we introduce a Channel DeCorrelation Regularization (CDCR) to minimize the redundancy between the components of the compared vectors, which facilitate learning a hierarchical representation. We demonstrate the effectiveness of the disentangled representation on various benchmarks, e.g., surpassing CLIP4Clip largely by +2.9 MSVD, VATEX, LSMDC, AcitivityNet, and DiDeMo, respectively.

READ FULL TEXT

page 2

page 5

page 15

page 16

page 21

page 22

research
03/25/2023

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Contrastive learning-based video-language representation learning approa...
research
11/26/2019

Independence Promoted Graph Disentangled Networks

We address the problem of disentangled representation learning with inde...
research
03/30/2019

Symmetry-Based Disentangled Representation Learning requires Interaction with Environments

Finding a generally accepted formal definition of a disentangled represe...
research
04/20/2022

Human-Object Interaction Detection via Disentangled Transformer

Human-Object Interaction Detection tackles the problem of joint localiza...
research
03/12/2021

Adversarial Graph Disentanglement

A real-world graph has a complex topology structure, which is often form...
research
04/27/2023

Rotation and Translation Invariant Representation Learning with Implicit Neural Representations

In many computer vision applications, images are acquired with arbitrary...
research
07/19/2023

DisCover: Disentangled Music Representation Learning for Cover Song Identification

In the field of music information retrieval (MIR), cover song identifica...

Please sign up or login with your details

Forgot password? Click here to reset