TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

01/24/2020
by   Jie Lei, et al.
8

We introduce a new multimodal retrieval task - TV show Retrieval (TVR), in which a short video moment has to be localized from a large video (with subtitle) corpus, given a natural language query. Different from previous moment retrieval tasks dealing with videos only, TVR requires the system to understand both the video and the associated subtitle text, making it a more realistic task. To support the study of this new task, we have collected a large-scale, high-quality dataset consisting of 108,965 queries on 21,793 videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal alignment. Strict qualification and post-annotation verification tests are applied to ensure the quality of the collected data. We present several baselines and a novel Cross-modal Moment Localization (XML) modular network for this new dataset and task. The proposed XML model surpasses all presented baselines by a large margin and with better efficiency, providing a strong starting point for future work. Extensive analysis experiments also show that incorporating both video and subtitle modules yields better performance than either alone. Lastly, we have also collected additional descriptions for each annotated moment in TVR to form a new multimodal captioning dataset with 262K captions, named the TV show Caption dataset (TVC). Here models need to jointly use the video and subtitle to generate a caption description. Both datasets are publicly available at https://tvr.cs.unc.edu

READ FULL TEXT

page 1

page 8

page 9

page 11

page 12

page 14

page 15

page 16

research
07/30/2021

MTVR: Multilingual Moment Retrieval in Videos

We introduce mTVR, a large-scale multilingual video moment retrieval dat...
research
07/30/2019

Temporal Localization of Moments in Video Collections with Natural Language

In this paper, we introduce the task of retrieving relevant video moment...
research
03/29/2023

Hierarchical Video-Moment Retrieval and Step-Captioning

There is growing interest in searching for information from large video ...
research
03/25/2020

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

We introduce a new task, Video-and-Language Inference, for joint multimo...
research
11/30/2021

AssistSR: Affordance-centric Question-driven Video Segment Retrieval

It is still a pipe dream that AI assistants on phone and AR glasses can ...
research
04/04/2021

FixMyPose: Pose Correctional Captioning and Retrieval

Interest in physical therapy and individual exercises such as yoga/dance...
research
05/23/2023

Faster Video Moment Retrieval with Point-Level Supervision

Video Moment Retrieval (VMR) aims at retrieving the most relevant events...

Please sign up or login with your details

Forgot password? Click here to reset