Multi-Level Recurrent Residual Networks for Action Recognition

11/22/2017

∙

Most existing Convolutional Neural Networks(CNNs) used for action recognition are either difficult to optimize or underuse crucial temporal information. Inspired by the fact that LSTM consistently makes breakthrough in the task related to sequence, we propose a novel Multi-Level Recurrent Residual Networks(MRRN) model which incorporates three separate recognition streams. The proposed model could capture spatiotemporal information by employing ResNets to learn spatial representations from static frames and stacked SRUs to learn temporal dynamics. Three distinct-level models are fused by averaging their softmax scores to obtain the complementary video representations. They are trained end-to-end with greater efficiency compared to state-of-the-art models. Our contributions are shown as follows: First, we analyze the effect of diverse hyper-parameter settings qualitatively to illustrate the general tendency of performance. Additionally, we experiment with low-, mid-, high-level representations of the video in various time pooling manners, experimentally demonstrating how well different level representations contribute to action recognition. Besides, we also make comparisons of computation complexity between competitive methods to verify the efficiency. Finally, A series of experiments are carried out on two standard video action benchmarks of HMDB-51 and UCF-101 dataset. Experimental results show MRRN exceeds the majority of models which only take RGB data as input and obtains comparable performances with the state-of-the-art without additional data, achieving 51.3 and 81.9

READ FULL TEXT

Multi-Level Recurrent Residual Networks for Action Recognition

RGB-D Based Action Recognition with Light-weight 3D Convolutional Networks

AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Recurrent Residual Learning for Action Recognition

Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction

Learning Multi-level Features For Sensor-based Human Action Recognition

Recurrence to the Rescue: Towards Causal Spatiotemporal Representations

Multi-Level Recurrent Residual Networks for Action Recognition

Related Research

RGB-D Based Action Recognition with Light-weight 3D Convolutional Networks

AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Recurrent Residual Learning for Action Recognition

Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction

Learning Multi-level Features For Sensor-based Human Action Recognition

Recurrence to the Rescue: Towards Causal Spatiotemporal Representations