Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning

by   Jiahang Zhang, et al.

Self-supervised learning has proved effective for skeleton-based human action understanding, which is an important yet challenging topic. Previous works mainly rely on contrastive learning or masked motion modeling paradigm to model the skeleton relations. However, the sequence-level and joint-level representation learning cannot be effectively and simultaneously handled by these methods. As a result, the learned representations fail to generalize to different downstream tasks. Moreover, combining these two paradigms in a naive manner leaves the synergy between them untapped and can lead to interference in training. To address these problems, we propose Prompted Contrast with Masked Motion Modeling, PCM^ 3, for versatile 3D action representation learning. Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner, which substantially boosts the generalization capacity for various downstream tasks. Specifically, masked prediction provides novel training views for contrastive learning, which in turn guides the masked prediction training with high-level semantic information. Moreover, we propose a dual-prompted multi-task pretraining strategy, which further improves model representations by reducing the interference caused by learning the two different pretext tasks. Extensive experiments on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM^ 3 compared to the state-of-the-art works. Our project is publicly available at: .


page 1

page 2

page 3

page 4


MS^2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

In this paper, we address self-supervised representation learning from h...

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining

Mainstream 3D representation learning approaches are built upon contrast...

Hierarchical Contrast for Unsupervised Skeleton-based Action Representation Learning

This paper targets unsupervised skeleton-based action representation lea...

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

Self-supervised learning has demonstrated remarkable capability in repre...

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance

Prior plays an important role in providing the plausible constraint on h...

Contrastive Self-Supervised Learning for Skeleton Representations

Human skeleton point clouds are commonly used to automatically classify ...

MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning

Learning effective motion features is an essential pursuit of video repr...

Please sign up or login with your details

Forgot password? Click here to reset