Contrastive Video Question Answering via Video Graph Transformer

02/27/2023
by   Junbin Xiao, et al.
0

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code will be available at https://github.com/doc-doc/CoVGT.

READ FULL TEXT

page 1

page 4

page 12

page 15

page 18

research
07/12/2022

Video Graph Transformer for Video Question Answering

This paper proposes a Video Graph Transformer (VGT) model for Video Quet...
research
07/22/2023

Discovering Spatio-Temporal Rationales for Video Question Answering

This paper strives to solve complex video question answering (VideoQA) w...
research
11/21/2022

Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Multi-modal reasoning in visual question answering (VQA) has witnessed r...
research
07/30/2021

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining

Nowadays, customer's demands for E-commerce are more diversified, which ...
research
05/12/2023

Comprehensive Solution Program Centric Pretraining for Table-and-Text Hybrid Numerical Reasoning

Numerical reasoning over table-and-text hybrid passages, such as financi...
research
03/25/2023

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Contrastive learning-based video-language representation learning approa...
research
08/07/2020

Location-aware Graph Convolutional Networks for Video Question Answering

We addressed the challenging task of video question answering, which req...

Please sign up or login with your details

Forgot password? Click here to reset