STAGE: Spatio-Temporal Attention on Graph Entities for Video Action Detection

12/09/2019
by   Matteo Tomei, et al.
33

Spatio-temporal action localization is a challenging yet fascinating task that aims to detect and classify human actions in video clips. In this paper, we develop a high-level video understanding module which can encode interactions between actors and objects both in space and time. In our formulation, spatio-temporal relationships are learned by performing self-attention operations on a graph structure connecting entities from consecutive clips. Noticeably, the use of graph learning is unprecedented for this task. From a computational point of view, the proposed module is backbone independent by design and does not need end-to-end training. When tested on the AVA dataset, it demonstrates a 10-16 baseline. Further, it can outperform or bring performances comparable to state-of-the-art models which require heavy end-to-end and synchronized training on multiple GPUs. Code is publicly available at: https://github.com/aimagelab/STAGE_action_detection.

READ FULL TEXT

page 1

page 4

page 8

page 12

research
04/21/2022

A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions

Spatio-temporal action detection is an important and challenging problem...
research
07/21/2022

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

The task of action detection aims at deducing both the action category a...
research
05/17/2019

Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding

Many problems in video understanding require labeling multiple activitie...
research
09/18/2023

Spatio-temporal Co-attention Fusion Network for Video Splicing Localization

Digital video splicing has become easy and ubiquitous. Malicious users c...
research
09/29/2022

4D-StOP: Panoptic Segmentation of 4D LiDAR using Spatio-temporal Object Proposal Generation and Aggregation

In this work, we present a new paradigm, called 4D-StOP, to tackle the t...
research
08/18/2021

Target Adaptive Context Aggregation for Video Scene Graph Generation

This paper deals with a challenging task of video scene graph generation...
research
03/12/2022

Deformable VisTR: Spatio temporal deformable attention for video instance segmentation

Video instance segmentation (VIS) task requires classifying, segmenting,...

Please sign up or login with your details

Forgot password? Click here to reset