3D Concept Learning and Reasoning from Multi-View Images

by   Yining Hong, et al.

Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions.


page 1

page 5

page 8

page 14

page 16

page 21

page 23

page 24


IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

Current visual question answering (VQA) tasks mainly consider answering ...

Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

Training models to apply common-sense linguistic knowledge and visual co...

A Comparison of Multi-View Learning Strategies for Satellite Image-Based Real Estate Appraisal

In the house credit process, banks and lenders rely on a fast and accura...

3D-LLM: Injecting the 3D World into Large Language Models

Large language models (LLMs) and Vision-Language Models (VLMs) have been...

3D Concept Grounding on Neural Fields

In this paper, we address the challenging problem of 3D concept groundin...

Visual Reasoning: from State to Transformation

Most existing visual reasoning tasks, such as CLEVR in VQA, ignore an im...

FERMAT: An Alternative to Accuracy for Numerical Reasoning

While pre-trained language models achieve impressive performance on vari...

Please sign up or login with your details

Forgot password? Click here to reset