Skeleton-Based Action Recognition with Synchronous Local and Non-local Spatio-temporal Learning and Frequency Attention
Benefiting from its succinctness and robustness, skeleton-based human action recognition has recently attracted much attention. Most existing methods utilize local networks, such as recurrent networks, convolutional neural networks, and graph convolutional networks, to extract spatio-temporal dynamics hierarchically. As a consequence, the local and non-local dependencies, which respectively contain more details and semantics, are asynchronously captured in different level of layers. Moreover, limited to the spatio-temporal domain, these methods ignored patterns in the frequency domain. To better extract information from multi-domains, we propose a residual frequency attention (rFA) to focus on discriminative patterns in the frequency domain, and a synchronous local and non-local (SLnL) block to simultaneously capture the details and semantics in the spatio-temporal domain. To optimize the whole process, we also propose a soft-margin focal loss (SMFL), which can automatically conducts adaptive data selection and encourages intrinsic margins in classifiers. Extensive experiments are performed on several large-scale action recognition datasets and our approach significantly outperforms other state-of-the-art methods.
READ FULL TEXT 
  
  
     share
 share