BatVision: Learning to See 3D Spatial Layout with Two Ears

Virtual camera images showing the correct layout of a space ahead can be generated by purely listening to the reflections of chirping sounds. Many species evolved sophisticated non-visual perception while artificial systems fall behind. Radar and ultrasound are used where cameras fail, but provide very limited information or require large, complex and expensive sensors. Yet sound is used effortlessly by dolphins, bats, wales and humans as a sensor modality with many advantages over vision. However, it is challenging to harness useful and detailed information for machine perception. We train a network to generate representations of the world in 2D and 3D only from sounds, sent by one speaker and captured by two microphones. Inspired by examples from nature, we emit short frequency modulated sound chirps and record returning echoes through an artificial human pinnae pair. We then learn to generate disparity-like depth maps and grayscale images from the echoes in an end-to-end fashion. With only low-cost equipment, our models show good reconstruction performance while being robust to errors and even overcoming limitations of our vision-based ground truth. Finally, we introduce a large dataset consisting of binaural sound signals synchronised in time with both RGB images and depth maps.

READ FULL TEXT

page 1

page 3

page 5

page 6

research
03/13/2023

The Audio-Visual BatVision Dataset for Research on Sight and Sound

Vision research showed remarkable success in understanding our world, pr...
research
06/14/2020

BatVision with GCC-PHAT Features for Better Sound to Vision Predictions

Inspired by sophisticated echolocation abilities found in nature, we tra...
research
08/18/2020

Depth Completion with RGB Prior

Depth cameras are a prominent perception system for robotics, especially...
research
07/20/2023

OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios

Modern approaches for vision-centric environment perception for autonomo...
research
12/04/2017

Visual to Sound: Generating Natural Sound for Videos in the Wild

As two of the five traditional human senses (sight, hearing, taste, smel...
research
03/30/2020

LayoutMP3D: Layout Annotation of Matterport3D

Inferring the information of 3D layout from a single equirectangular pan...
research
10/09/2018

Functionally Modular and Interpretable Temporal Filtering for Robust Segmentation

The performance of autonomous systems heavily relies on their ability to...

Please sign up or login with your details

Forgot password? Click here to reset