SVGraph: Learning Semantic Graphs from Instructional Videos

07/16/2022
by   Madeline C. Schiappa, et al.
0

In this work, we focus on generating graphical representations of noisy, instructional videos for video understanding. We propose a self-supervised, interpretable approach that does not require any annotations for graphical representations, which would be expensive and time consuming to collect. We attempt to overcome "black box" learning limitations by presenting Semantic Video Graph or SVGraph, a multi-modal approach that utilizes narrations for semantic interpretability of the learned graphs. SVGraph 1) relies on the agreement between multiple modalities to learn a unified graphical structure with the help of cross-modal attention and 2) assigns semantic interpretation with the help of Semantic-Assignment, which captures the semantics from video narration. We perform experiments on multiple datasets and demonstrate the interpretability of SVGraph in semantic graph learning.

READ FULL TEXT

page 2

page 7

page 8

page 9

page 16

page 17

page 19

page 20

research
03/07/2020

Cross-modal Learning for Multi-modal Video Categorization

Multi-modal machine learning (ML) models can process data in multiple mo...
research
12/07/2021

STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Contrastive representation learning of videos highly relies on the avail...
research
06/10/2021

Cross-Modal Discrete Representation Learning

Recent advances in representation learning have demonstrated an ability ...
research
08/24/2023

Preserving Modality Structure Improves Multi-Modal Learning

Self-supervised learning on large-scale multi-modal datasets allows lear...
research
03/26/2023

Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies

Movie highlights stand out of the screenplay for efficient browsing and ...
research
07/01/2022

(Un)likelihood Training for Interpretable Embedding

Cross-modal representation learning has become a new normal for bridging...
research
11/19/2020

Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

In this paper, we teach machines to understand visuals and natural langu...

Please sign up or login with your details

Forgot password? Click here to reset