OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

by   Yinxuan Huang, et al.

Humans possess the cognitive ability to comprehend scenes in a compositional manner. To empower AI systems with similar abilities, object-centric representation learning aims to acquire representations of individual objects from visual scenes without any supervision. Although recent advancements in object-centric representation learning have achieved remarkable progress on complex synthesis datasets, there is a huge challenge for application in complex real-world scenes. One of the essential reasons is the scarcity of real-world datasets specifically tailored to object-centric representation learning methods. To solve this problem, we propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes, which is meticulously designed to serve as a benchmark for comparing, evaluating and analyzing object-centric representation learning methods. OCTScenes contains 5000 tabletop scenes with a total of 15 everyday objects. Each scene is captured in 60 frames covering a 360-degree perspective. Consequently, OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the evaluation of object-centric representation learning methods across static scenes, dynamic scenes, and multi-view scenes tasks. Extensive experiments of object-centric representation learning methods for static, dynamic and multi-view scenes are conducted on OCTScenes. The results demonstrate the shortcomings of state-of-the-art methods for learning meaningful representations from real-world data, despite their impressive performance on complex synthesis datasets. Furthermore, OCTScenes can serves as a catalyst for advancing existing state-of-the-art methods, inspiring them to adapt to real-world scenes. Dataset and code are available at https://huggingface.co/datasets/Yinxuan/OCTScenes.


page 6

page 10

page 16

page 17

page 18

page 19


Object-Centric Representation Learning with Generative Spatial-Temporal Factorization

Learning object-centric scene representations is essential for attaining...

wildNeRF: Complete view synthesis of in-the-wild dynamic scenes captured using sparse monocular data

We present a novel neural radiance model that is trainable in a self-sup...

DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Advances in neural fields are enabling high-fidelity capture of the shap...

Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning

Recent years have seen a surge of interest in learning high-level causal...

RobustCLEVR: A Benchmark and Framework for Evaluating Robustness in Object-centric Learning

Object-centric representation learning offers the potential to overcome ...

DySR: A Dynamic Representation Learning and Aligning based Model for Service Bundle Recommendation

An increasing number and diversity of services are available, which resu...

ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation

There has been a recent surge in methods that aim to decompose and segme...

Please sign up or login with your details

Forgot password? Click here to reset