Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

11/17/2021
by   Yue Yang, et al.
0

Schemata are structured representations of complex tasks that can aid artificial intelligence by allowing models to break down complex tasks into intermediate steps. We propose a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance. Our system proceeds in three major phases: (1) Given a task with related videos, we construct an initial schema for a task using a joint video-text model to match video segments with text representing steps from wikiHow; (2) We generalize schemata to unseen tasks by leveraging language models to edit the text within existing schemata. Through generalization, we can allow our schemata to cover a more extensive range of tasks with a small amount of learning data; (3) We conduct zero-shot instructional video retrieval with the unseen task names as the queries. Our schema-guided approach outperforms existing methods for video retrieval, and we demonstrate that the schemata induced by our system are better than those generated by other models.

READ FULL TEXT

page 1

page 3

page 8

research
09/16/2023

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

Large-scale noisy web image-text datasets have been proven to be efficie...
research
10/13/2021

SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Zero/few-shot transfer to unseen services is a critical challenge in tas...
research
06/06/2021

Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions

Can we teach a robot to recognize and make predictions for activities th...
research
01/31/2023

Learning Universal Policies via Text-Guided Video Generation

A goal of artificial intelligence is to construct an agent that can solv...
research
08/23/2022

VILT: Video Instructions Linking for Complex Tasks

This work addresses challenges in developing conversational assistants t...
research
04/10/2018

Imagine This! Scripts to Compositions to Videos

Imagining a scene described in natural language with realistic layout an...
research
03/24/2023

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Multimodal alignment facilitates the retrieval of instances from one mod...

Please sign up or login with your details

Forgot password? Click here to reset