E2E-LOAD: End-to-End Long-form Online Action Detection

by   Shuqiang Cao, et al.

Recently, there has been a growing trend toward feature-based approaches for Online Action Detection (OAD). However, these approaches have limitations due to their fixed backbone design, which ignores the potential capability of a trainable backbone. In this paper, we propose the first end-to-end OAD model, termed E2E-LOAD, designed to address the major challenge of OAD, namely, long-term understanding and efficient online reasoning. Specifically, our proposed approach adopts an initial spatial model that is shared by all frames and maintains a long sequence cache for inference at a low computational cost. We also advocate an asymmetric spatial-temporal model for long-form and short-form modeling effectively. Furthermore, we propose a novel and efficient inference mechanism that accelerates heavy spatial-temporal exploration. Extensive ablation studies and experiments demonstrate the effectiveness and efficiency of our proposed method. Notably, we achieve 17.3 (+12.6) FPS for end-to-end OAD with 72.4 THMOUS14, TVSeries, and HDD, respectively, which is 3x faster than previous approaches. The source code will be made publicly available.


Spatial-Temporal Memory Networks for Video Object Detection

We introduce Spatial-Temporal Memory Networks (STMN) for video object de...

Minimum Efforts to Build an End-to-End Spatial-Temporal Action Detector

Spatial-temporal action detection is a vital part of video understanding...

TALLFormer: Temporal Action Localization with Long-memory Transformer

Most modern approaches in temporal action localization divide this probl...

An end-to-end multi-scale network for action prediction in videos

In this paper, we develop an efficient multi-scale network to predict ac...

Learning Reinforced Attentional Representation for End-to-End Visual Tracking

Despite the fact that tremendous advances have been made by numerous rec...

Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding

This paper proposes a 4D backbone for long-term point cloud video unders...

Please sign up or login with your details

Forgot password? Click here to reset