Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions

This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.

READ FULL TEXT

page 1

page 2

page 8

research
03/29/2023

What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Spatio-temporal grounding describes the task of localizing events in spa...
research
06/16/2021

Grounding Spatio-Temporal Language with Transformers

Language is an interface to the outside world. In order for embodied age...
research
05/06/2023

LEO: Generative Latent Image Animator for Human Video Synthesis

Spatio-temporal coherency is a major challenge in synthesizing high qual...
research
10/08/2016

4D Crop Monitoring: Spatio-Temporal Reconstruction for Agriculture

Autonomous crop monitoring at high spatial and temporal resolution is a ...
research
05/01/2020

A Comprehensive Study on Visual Explanations for Spatio-temporal Networks

Identifying and visualizing regions that are significant for a given dee...
research
10/16/2022

A New Spatio-Temporal Loss Function for 3D Motion Reconstruction and Extended Temporal Metrics for Motion Evaluation

We propose a new loss function that we call Laplacian loss, based on spa...
research
07/09/2022

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR

In this technical report, we represent our solution for the Human-centri...

Please sign up or login with your details

Forgot password? Click here to reset