TL;DW? Summarizing Instructional Videos with Task Relevance Cross-Modal Saliency

by   Medhini Narasimhan, et al.

YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark.


VideoXum: Cross-modal Visual and Textural Summarization of Videos

Video summarization aims to distill the most important information from ...

A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos

Understanding the steps required to perform a task is an important skill...

Visual Summarization of Lecture Video Segments for Enhanced Navigation

Lecture videos are an increasingly important learning resource for highe...

Query-based Video Summarization with Pseudo Label Supervision

Existing datasets for manually labelled query-based video summarization ...

A Multi-stage deep architecture for summary generation of soccer videos

Video content is present in an ever-increasing number of fields, both sc...

Co-Regularized Deep Representations for Video Summarization

Compact keyframe-based video summaries are a popular way of generating v...

Textually Customized Video Summaries

The best summary of a long video differs among different people due to i...

Please sign up or login with your details

Forgot password? Click here to reset