Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition

12/09/2014
by   Seungwhan Moon, et al.
0

We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. Specifically, we show that we can leverage speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach first learns the analogy-preserving embeddings between the abstract representations learned from intermediate layers of each network, allowing for semantics-level transfer between the source and target modalities. We then apply our neural network operation that fine-tunes the target network with the additional knowledge transferred from the source network, while keeping the topology of the target network unchanged. While we present an audio-visual recognition task as an application of our approach, our framework is flexible and thus can work with any multimodal dataset, or with any already-existing deep networks that share the common underlying semantics. In this work in progress report, we aim to provide comprehensive results of different configurations of the proposed approach on two widely used audio-visual datasets, and we discuss potential applications of the proposed approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2021

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Having access to multi-modal cues (e.g. vision and audio) empowers some ...
research
10/19/2019

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

We present an audio-visual multimodal approach for the task of zeroshot ...
research
07/14/2022

A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

While audio-visual speech models can yield superior performance and robu...
research
04/23/2017

Learning weakly supervised multimodal phoneme embeddings

Recent works have explored deep architectures for learning multimodal sp...
research
01/22/2015

Deep Multimodal Learning for Audio-Visual Speech Recognition

In this paper, we present methods in deep multimodal learning for fusing...
research
06/16/2020

Deep Multimodal Transfer-Learned Regression in Data-Poor Domains

In many real-world applications of deep learning, estimation of a target...
research
03/02/2019

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

In this paper, we focus on the challenging perception problem in robotic...

Please sign up or login with your details

Forgot password? Click here to reset