Non-Sequential Graph Script Induction via Multimedia Grounding

by   Yu Zhou, et al.

Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5 are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating "branching". In this paper, we propose the new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to WikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52 on F1@3 for next step prediction and 13.8 sequence completion. Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76 non-sequential step relationships.


Multimedia Generative Script Learning for Task Planning

Goal-oriented generative script learning aims to generate subsequent ste...

StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

Instructional videos are an important resource to learn procedural tasks...

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

The abundance of instructional videos and their narrations over the Inte...

Learning and Verification of Task Structure in Instructional Videos

Given the enormous number of instructional videos available online, lear...

STEPS: A Benchmark for Order Reasoning in Sequential Tasks

Various human activities can be abstracted into a sequence of actions in...

Procedure-Aware Pretraining for Instructional Video Understanding

Our goal is to learn a video representation that is useful for downstrea...

SVIP: Sequence VerIfication for Procedures in Videos

In this paper, we propose a novel sequence verification task that aims t...

Please sign up or login with your details

Forgot password? Click here to reset