Self-supervised Video Representation Learning with Cascade Positive Retrieval

by   Cheng-En Wu, et al.

Self-supervised video representation learning has been shown to effectively improve downstream tasks such as video retrieval and action recognition. In this paper, we present the Cascade Positive Retrieval (CPR) that successively mines positive examples w.r.t. the query for contrastive learning in a cascade of stages. Specifically, CPR exploits multiple views of a query example in different modalities, where an alternative view may help find another positive example dissimilar in the query view. We explore the effects of possible CPR configurations in ablations including the number of mining stages, the top similar example selection ratio in each stage, and progressive training with an incremental number of the final Top-k selection. The overall mining quality is measured to reflect the recall across training set classes. CPR reaches a median class mining recall of 83.3 Implementation-wise, CPR is complementary to pretext tasks and can be easily applied to previous work. In the evaluation of pretraining on UCF101, CPR consistently improves existing work and even achieves state-of-the-art R@1 of 56.7 recognition on UCF101 and HMDB51. For transfer from large video dataset Kinetics400 to UCF101 and HDMB, CPR benefits existing work, showing competitive Top-1 accuracies of 85.1 and frame sampling rate. The code will be released soon for reproducing the results. The code is available at


page 5

page 11

page 12


Self-supervised Co-training for Video Representation Learning

The objective of this paper is visual-only self-supervised video represe...

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation...

Self-Supervised Video Representation Learning with Meta-Contrastive Network

Self-supervised learning has been successfully applied to pre-train vide...

TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition

Recognizing transformation types applied to a video clip (RecogTrans) is...

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition

In recent years, self-supervised representation learning for skeleton-ba...

Sample and Predict Your Latent: Modality-free Sequential Disentanglement via Contrastive Estimation

Unsupervised disentanglement is a long-standing challenge in representat...

Robust and Efficient Imbalanced Positive-Unlabeled Learning with Self-supervision

Learning from positive and unlabeled (PU) data is a setting where the le...

Please sign up or login with your details

Forgot password? Click here to reset