Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

12/12/2021
by   Junbin Xiao, et al.
0

Video question answering requires the models to understand and reason about both the complex video and language data to correctly derive the answers. Existing efforts have been focused on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to the problem of question-answering and lacking interpretability as well. In this work, we argue that while video is presented in frame sequence, the visual elements (e.g., objects, actions, activities and events) are not sequential but rather hierarchical in semantic space. To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues. Despite the simplicity, our extensive experiments demonstrate the superiority of such conditional hierarchical graph architecture, with clear performance improvements over prior methods and also better generalization across different type of questions. Further analyses also demonstrate the model's reliability as it shows meaningful visual-textual evidences for the predicted answers.

READ FULL TEXT

page 2

page 7

page 12

research
04/25/2022

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

Reasoning about causal and temporal event relations in videos is a new d...
research
05/13/2022

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

A key challenge in video question answering is how to realize the cross-...
research
05/28/2022

Visual Superordinate Abstraction for Robust Concept Learning

Concept learning constructs visual representations that are connected to...
research
05/14/2023

Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Event-Level Video Question Answering (EVQA) requires complex reasoning a...
research
04/29/2021

Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering

This paper presents a novel method, termed Bridge to Answer, to infer co...
research
07/26/2022

Equivariant and Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering the natural ...
research
02/25/2020

Hierarchical Conditional Relation Networks for Video Question Answering

Video question answering (VideoQA) is challenging as it requires modelin...

Please sign up or login with your details

Forgot password? Click here to reset