Object-Centric Representation Learning for Video Question Answering

by   Long Hoang Dang, et al.

Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors. The task demands new capabilities to integrate video processing, language understanding, binding abstract linguistic concepts to concrete visual artifacts, and deliberative reasoning over spacetime. Neural networks offer a promising approach to reach this potential through learning from examples rather than handcrafting features and rules. However, neural networks are predominantly feature-based - they map data to unstructured vectorial representation and thus can fall into the trap of exploiting shortcuts through surface statistics instead of true systematic reasoning seen in symbolic systems. To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level pattern recognition and high-level symbolic algebra. To this end, we propose a new query-guided representation framework to turn a video into an evolving relational graph of objects, whose features and interactions are dynamically and conditionally inferred. The object lives are then summarized into resumes, lending naturally for deliberative relational reasoning that produces an answer to the query. The framework is evaluated on major Video QA datasets, demonstrating clear benefits of the object-centric approach to video reasoning.


page 1

page 2


Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Video Question Answering (Video QA) is a powerful testbed to develop new...

Learning to Reason with Relational Video Representation for Question Answering

How does machine learn to reason about the content of a video in answeri...

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Spatio-temporal scene-graph approaches to video-based reasoning tasks su...

Video Dialog as Conversation about Objects Living in Space-Time

It would be a technological feat to be able to create a system that can ...

Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?

Recent advances in visual representation learning allowed to build an ab...

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

Video QA challenges modelers in multiple fronts. Modeling video necessit...

Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition

In this work, following the intuition that adverbs describing scene-sequ...

Please sign up or login with your details

Forgot password? Click here to reset