3D Question Answering

by   Shuquan Ye, et al.

Visual Question Answering (VQA) has witnessed tremendous progress in recent years. However, most efforts only focus on the 2D image question answering tasks. In this paper, we present the first attempt at extending VQA to the 3D domain, which can facilitate artificial intelligence's perception of 3D real-world scenarios. Different from image based VQA, 3D Question Answering (3DQA) takes the color point cloud as input and requires both appearance and 3D geometry comprehension ability to answer the 3D-related questions. To this end, we propose a novel transformer-based 3DQA framework “3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information, respectively. The multi-modal information of appearance, geometry, and the linguistic question can finally attend to each other via a 3D-Linguistic Bert to predict the target answers. To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset “ScanQA", which builds on the ScanNet dataset and contains ∼6K questions, ∼30K answers for 806 scenes. Extensive experiments on this dataset demonstrate the obvious superiority of our proposed 3DQA framework over existing VQA frameworks, and the effectiveness of our major designs. Our code and dataset will be made publicly available to facilitate the research in this direction.


page 2

page 5

page 8

page 13

page 14

page 15

page 16

page 17


IQ-VQA: Intelligent Visual Question Answering

Even though there has been tremendous progress in the field of Visual Qu...

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

In recent years, visual question answering (VQA) has attracted attention...

Towards Multi-Lingual Visual Question Answering

Visual Question Answering (VQA) has been primarily studied through the l...

DVQA: Understanding Data Visualizations via Question Answering

Bar charts are an effective way for humans to convey information to each...

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Video Question Answering (VideoQA), aiming to correctly answer the given...

On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

Visual Question Answering (VQA) methods have made incredible progress, b...

Neural Self Talk: Image Understanding via Continuous Questioning and Answering

In this paper we consider the problem of continuously discovering image ...

Please sign up or login with your details

Forgot password? Click here to reset