Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement

by   Walter Goodwin, et al.

Object rearrangement has recently emerged as a key competency in robot manipulation, with practical solutions generally involving object detection, recognition, grasping and high-level planning. Goal-images describing a desired scene configuration are a promising and increasingly used mode of instruction. A key outstanding challenge is the accurate inference of matches between objects in front of a robot, and those seen in a provided goal image, where recent works have struggled in the absence of object-specific training data. In this work, we explore the deterioration of existing methods' ability to infer matches between objects as the visual shift between observed and goal scenes increases. We find that a fundamental limitation of the current setting is that source and target images must contain the same instance of every object, which restricts practical deployment. We present a novel approach to object matching that uses a large pre-trained vision-language model to match objects in a cross-instance setting by leveraging semantics together with visual features as a more robust, and much more general, measure of similarity. We demonstrate that this provides considerably improved matching performance in cross-instance settings, and can be used to guide multi-object rearrangement with a robot manipulator from an image that shares no object instances with the robot's scene.


page 1

page 2

page 4

page 7


DFBVS: Deep Feature-Based Visual Servo

Classical Visual Servoing (VS) rely on handcrafted visual features, whic...

KITE: Keypoint-Conditioned Policies for Semantic Manipulation

While natural language offers a convenient shared interface for humans a...

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

Well structured visual representations can make robot learning faster an...

Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching

This paper presents a robotic pick-and-place system that is capable of g...

Learning to Identify Object Instances by Touch: Tactile Recognition via Multimodal Matching

Much of the literature on robotic perception focuses on the visual modal...

Efficient Object Rearrangement via Multi-view Fusion

The prospect of assistive robots aiding in object organization has alway...

Implicit Object Mapping With Noisy Data

Modelling individual objects as Neural Radiance Fields (NeRFs) within a ...

Please sign up or login with your details

Forgot password? Click here to reset