Correlation Net : spatio temporal multimodal deep learning
This paper describes a network that is able to capture spatiotemporal correlations over arbitrary periods of time. The proposed scheme operates as a complementary, extended network over spatiotemporal regions. Recently, multimodal fusion has been extensively researched in deep learning. For action recognition, the spatial and temporal streams are vital components of deep Convolutional Neural Network (CNNs), but reducing the occurrence of overfitting and fusing these two streams remain open problems. The existing fusion approach is to average the two streams. To this end, we propose a correlation network with a Shannon regularizer to learn a CNN that has already been trained. Long-range video may consist of spatiotemporal correlation over arbitrary periods of time. This correlation can be captured using simple fully connected layers to form the correlation network. This is found to be complementary to the existing network fusion methods. We evaluate our approach on the UCF-101 and HMDB-51 datasets, and the resulting improvement in accuracy demonstrates the importance of multimodal correlation.
READ FULL TEXT