There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge

Attributes of sound inherent to objects can provide valuable cues to learn rich representations for object detection and tracking. Furthermore, the co-occurrence of audiovisual events in videos can be exploited to localize objects over the image field by solely monitoring the sound in the environment. Thus far, this has only been feasible in scenarios where the camera is static and for single object detection. Moreover, the robustness of these methods has been limited as they primarily rely on RGB images which are highly susceptible to illumination and weather changes. In this work, we present the novel self-supervised MM-DistillNet framework consisting of multiple teachers that leverage diverse modalities including RGB, depth and thermal images, to simultaneously exploit complementary cues and distill knowledge into a single audio student network. We propose the new MTA loss function that facilitates the distillation of information from multimodal teachers in a self-supervised manner. Additionally, we propose a novel self-supervised pretext task for the audio student that enables us to not rely on labor-intensive manual annotations. We introduce a large-scale multimodal dataset with over 113,000 time-synchronized frames of RGB, depth, thermal, and audio modalities. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods while being able to detect multiple objects using only sound during inference and even while moving.

READ FULL TEXT

page 6

page 8

page 11

page 15

page 16

page 17

page 18

page 19

research
10/25/2019

Self-supervised Moving Vehicle Tracking with Stereo Sound

Humans are able to localize objects in the environment using both visual...
research
01/30/2022

Self-Supervised Moving Vehicle Detection from Audio-Visual Cues

Robust detection of moving vehicles is a critical task for any autonomou...
research
08/10/2020

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visua...
research
07/18/2019

Self-supervised Training of Proposal-based Segmentation via Background Prediction

While supervised object detection methods achieve impressive accuracy, t...
research
02/23/2021

RGB-D Railway Platform Monitoring and Scene Understanding for Enhanced Passenger Safety

Automated monitoring and analysis of passenger movement in safety-critic...
research
03/17/2023

Scribble-Supervised RGB-T Salient Object Detection

Salient object detection segments attractive objects in scenes. RGB and ...
research
02/14/2023

Event-guided Multi-patch Network with Self-supervision for Non-uniform Motion Deblurring

Contemporary deep learning multi-scale deblurring models suffer from man...

Please sign up or login with your details

Forgot password? Click here to reset