Compressed Video Action Recognition

by   Chao-Yuan Wu, et al.

Training robust deep video representations has proven to be much more challenging than learning deep image representations and consequently hampered tasks like video action recognition. This is in part due to the enormous size of raw video streams, the associated amount of computation required, and the high temporal redundancy. The 'true' and interesting signal is often drowned in too much irrelevant data. Motivated by the fact that the superfluous information can be reduced by up to two orders of magnitude with video compression techniques (like H.264, HEVC, etc.), in this work, we propose to train a deep network directly on the compressed video, devoid of redundancy, rather than the traditional highly redundant RGB stream. This representation has a higher information density and we found the training to be easier. In addition, the signals in a compressed video provide free, albeit noisy, motion information. We propose novel techniques to use them effectively. Our approach is about 4.6 times faster than a state-of-the-art 3D-CNN model, 2.7 times faster than a ResNet-152, and very easy to implement. On the task of action recognition, our approach outperforms all the other methods on the UCF-101, HMDB-51, and Charades dataset.


page 8

page 14


Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

Two-stream networks have achieved great success in video recognition. A ...

Faster and Accurate Compressed Video Action Recognition Straight from the Frequency Domain

Human action recognition has become one of the most active field of rese...

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

Most of existing video action recognition models ingest raw RGB frames. ...

Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain

With the widespread use of installed cameras, video-based monitoring app...

Mimic The Raw Domain: Accelerating Action Recognition in the Compressed Domain

Video understanding usually requires expensive computation that prohibit...

T-RECS: Training for Rate-Invariant Embeddings by Controlling Speed for Action Recognition

An action should remain identifiable when modifying its speed: consider ...

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

This paper studies the joint learning of action recognition and temporal...

Please sign up or login with your details

Forgot password? Click here to reset