Multi-Level Recurrent Residual Networks for Action Recognition
Most existing Convolutional Neural Networks(CNNs) used for action recognition are either difficult to optimize or underuse crucial temporal information. Inspired by the fact that LSTM consistently makes breakthrough in the task related to sequence, we propose a novel Multi-Level Recurrent Residual Networks(MRRN) model which incorporates three separate recognition streams. The proposed model could capture spatiotemporal information by employing ResNets to learn spatial representations from static frames and stacked SRUs to learn temporal dynamics. Three distinct-level models are fused by averaging their softmax scores to obtain the complementary video representations. They are trained end-to-end with greater efficiency compared to state-of-the-art models. Our contributions are shown as follows: First, we analyze the effect of diverse hyper-parameter settings qualitatively to illustrate the general tendency of performance. Additionally, we experiment with low-, mid-, high-level representations of the video in various time pooling manners, experimentally demonstrating how well different level representations contribute to action recognition. Besides, we also make comparisons of computation complexity between competitive methods to verify the efficiency. Finally, A series of experiments are carried out on two standard video action benchmarks of HMDB-51 and UCF-101 dataset. Experimental results show MRRN exceeds the majority of models which only take RGB data as input and obtains comparable performances with the state-of-the-art without additional data, achieving 51.3 and 81.9
READ FULL TEXT