Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

09/14/2023
by   Fen Fang, et al.
0

A key challenge with procedure planning in instructional videos lies in how to handle a large decision space consisting of a multitude of action types that belong to various tasks. To understand real-world video content, an AI agent must proficiently discern these action types (e.g., pour milk, pour water, open lid, close lid, etc.) based on brief visual observation. Moreover, it must adeptly capture the intricate semantic relation of the action types and task goals, along with the variable action sequences. Recently, notable progress has been made via the integration of diffusion models and visual representation learning to address the challenge. However, existing models employ rudimentary mechanisms to utilize task information to manage the decision space. To overcome this limitation, we introduce a simple yet effective enhancement - a masked diffusion model. The introduced mask acts akin to a task-oriented attention filter, enabling the diffusion/denoising process to concentrate on a subset of action types. Furthermore, to bolster the accuracy of task classification, we harness more potent visual representation learning techniques. In particular, we learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions. We evaluate the method on three public datasets and achieve state-of-the-art performance on multiple metrics. Code is available at https://github.com/ffzzy840304/Masked-PDPP.

READ FULL TEXT

page 4

page 13

research
03/26/2023

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

In this paper, we study the problem of procedure planning in instruction...
research
07/02/2019

Procedure Planning in Instructional Videos

We propose a new challenging task: procedure planning in instructional v...
research
05/09/2023

Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer

Large-scale text-to-video diffusion models have demonstrated an exceptio...
research
06/13/2019

Learning Spatio-Temporal Representation with Local and Global Diffusion

Convolutional Neural Networks (CNN) have been regarded as a powerful cla...
research
12/18/2021

Tell me what you see: A zero-shot action recognition method based on natural language descriptions

Recently, several approaches have explored the detection and classificat...
research
08/17/2023

Event-Guided Procedure Planning from Instructional Videos with Text Supervision

In this work, we focus on the task of procedure planning from instructio...
research
03/28/2018

Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

We introduce the task of directly modeling a visually intelligent agent....

Please sign up or login with your details

Forgot password? Click here to reset