Visual Abductive Reasoning

by   Chen Liang, et al.

Abductive reasoning seeks the likeliest possible explanation for partial observations. Although abduction is frequently employed in human daily reasoning, it is rarely explored in computer vision literature. In this paper, we propose a new task and dataset, Visual Abductive Reasoning (VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the visual premise. Based on our large-scale VAR dataset, we devise a strong baseline model, Reasoner (causal-and-cascaded reasoning Transformer). First, to capture the causal structure of the observations, a contextualized directional position embedding strategy is adopted in the encoder, that yields discriminative representations for the premise and hypothesis. Then, multiple decoders are cascaded to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that Reasoner surpasses many famous video-language models, while still being far behind human performance. This work is expected to foster future efforts in the reasoning-beyond-observation paradigm.


page 2

page 4

page 6

page 7

page 8

page 11

page 14

page 15


CLEVRER: CoLlision Events for Video REpresentation and Reasoning

The ability to reason about temporal and causal events from videos lies ...

Can Language Models perform Abductive Commonsense Reasoning?

Abductive Reasoning is a task of inferring the most plausible hypothesis...

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

We introduce a new task, Video-and-Language Inference, for joint multimo...

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Vision-Language Models (VLMs) are expected to be capable of reasoning wi...

Cascaded Mutual Modulation for Visual Reasoning

Visual reasoning is a special visual question answering problem that is ...

Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects

Human vision greatly benefits from the information about sizes of object...

Computer-Simulation Model Theory (P= NP is not provable)

The simulation hypothesis says that all the materials and events in the ...

Please sign up or login with your details

Forgot password? Click here to reset