ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers

03/21/2023
by   Akash Gupta, et al.
0

Lack of audio-video synchronization is a common problem during television broadcasts and video conferencing, leading to an unsatisfactory viewing experience. A widely accepted paradigm is to create an error detection mechanism that identifies the cases when audio is leading or lagging. We propose ModEFormer, which independently extracts audio and video embeddings using modality-specific transformers. Different from the other transformer-based approaches, ModEFormer preserves the modality of the input streams which allows us to use a larger batch size with more negative audio samples for contrastive learning. Further, we propose a trade-off between the number of negative samples and number of unique samples in a batch to significantly exceed the performance of previous methods. Experimental results show that ModEFormer achieves state-of-the-art performance, 94.5 90.9 detection for test clips.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2021

Multimodal Self-Supervised Learning of General Audio Representations

We present a multimodal framework to learn general audio representations...
research
01/23/2023

Zorro: the masked multimodal transformer

Attention-based models are appealing for multimodal processing because i...
research
05/22/2023

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Text-to-audio (TTA) generation is a recent popular problem that aims to ...
research
11/08/2022

On Negative Sampling for Contrastive Audio-Text Retrieval

This paper investigates negative sampling for contrastive learning in th...
research
10/27/2022

Multimodal Transformer Distillation for Audio-Visual Synchronization

Audio-visual synchronization aims to determine whether the mouth movemen...
research
06/27/2021

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

State of the art architectures for untrimmed video Temporal Action Local...
research
08/18/2023

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

The task of lip synchronization (lip-sync) seeks to match the lips of hu...

Please sign up or login with your details

Forgot password? Click here to reset