FunQA: Towards Surprising Video Comprehension

by   Binzhu Xie, et al.
Nanyang Technological University

Surprising videos, e.g., funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video, and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Extensive experiments with existing VideoQA models reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.


page 1

page 9

page 15

page 16

page 24

page 25


How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

Understanding web instructional videos is an essential branch of video u...

NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions

We introduce NExT-QA, a rigorously designed video question answering (Vi...

Watching the News: Towards VideoQA Models that can Read

Video Question Answering methods focus on commonsense reasoning and visu...

TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Traffic event cognition and reasoning in videos is an important task tha...

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Video question answering is a challenging task, which requires agents to...

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Video-grounded dialogues are very challenging due to (i) the complexity ...

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

Understanding human tasks through video observations is an essential cap...

Please sign up or login with your details

Forgot password? Click here to reset