EgoTaskQA: Understanding Human Tasks in Egocentric Videos

10/08/2022
by   Baoxiong Jia, et al.
0

Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning.

READ FULL TEXT

page 2

page 5

page 6

page 19

page 21

page 22

research
10/03/2019

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

The ability to reason about temporal and causal events from videos lies ...
research
07/31/2020

LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities

Understanding and interpreting human actions is a long-standing challeng...
research
05/18/2021

NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions

We introduce NExT-QA, a rigorously designed video question answering (Vi...
research
03/11/2020

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Captioning is a crucial and challenging task for video understanding. In...
research
05/30/2022

From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

Video understanding has achieved great success in representation learnin...
research
06/26/2023

FunQA: Towards Surprising Video Comprehension

Surprising videos, e.g., funny clips, creative performances, or visual i...
research
12/08/2020

CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions

Recent advances in Artificial Intelligence and deep learning have revive...

Please sign up or login with your details

Forgot password? Click here to reset