NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as gains on long-tail object queries, and the ability to perform zero-shot and few-shot NLQ.

READ FULL TEXT

page 1

page 2

page 4

page 7

research
10/24/2022

Language-free Training for Zero-shot Video Grounding

Given an untrimmed video and a language query depicting a specific tempo...
research
08/10/2023

Encode-Store-Retrieve: Enhancing Memory Augmentation through Language-Encoded Egocentric Perception

We depend on our own memory to encode, store, and retrieve our experienc...
research
02/16/2023

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Video understanding tasks take many forms, from action detection to visu...
research
06/06/2021

Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions

Can we teach a robot to recognize and make predictions for activities th...
research
03/21/2022

CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

Gameplay videos contain rich information about how players interact with...
research
09/20/2013

Saying What You're Looking For: Linguistics Meets Video Search

We present an approach to searching large video corpora for video clips ...
research
09/20/2019

Retro-Actions: Learning 'Close' by Time-Reversing 'Open' Videos

We investigate video transforms that result in class-homogeneous label-t...

Please sign up or login with your details

Forgot password? Click here to reset