Unsupervised learning from videos using temporal coherency deep networks

by   Carolina Redondo-Cabrera, et al.

In this work we address the challenging problem of unsupervised learning from videos. Existing methods utilize the spatio-temporal continuity in contiguous video frames as regularization for the learning process. Typically, this temporal coherence of close frames is used as a free form of annotation, encouraging the learned representations to exhibit small differences between these frames. But this type of approach fails to capture the dissimilarity between videos with different content, hence learning less discriminative features. We here propose two Siamese architectures for Convolutional Neural Networks, and their corresponding novel loss functions, to learn from unlabeled videos, which jointly exploit the local temporal coherence between contiguous frames, and a global discriminative margin used to separate representations of different videos. An extensive experimental evaluation is presented, where we validate the proposed models on various tasks. First, we show how the learned features can be used to discover actions and scenes in video collections. Second, we show the benefits of such an unsupervised learning from just unlabeled videos, which can be directly used as a prior for the supervised recognition tasks of actions and objects in images, where our results further show that our features can even surpass a traditional and heavily supervised pre-training plus fine-tunning strategy.


page 11

page 12

page 20


Slow and steady feature analysis: higher order temporal coherence in video

How can unlabeled video augment visual learning? Existing methods perfor...

Incorporating Scalability in Unsupervised Spatio-Temporal Feature Learning

Deep neural networks are efficient learning machines which leverage upon...

Object-Centric Representation Learning from Unlabeled Videos

Supervised (pre-)training currently yields state-of-the-art performance ...

Semi-Supervised Action Recognition with Temporal Contrastive Learning

Learning to recognize actions from only a handful of labeled videos is a...

Video alignment using unsupervised learning of local and global features

In this paper, we tackle the problem of video alignment, the process of ...

Are You Imitating Me? Unsupervised Sparse Modeling for Group Activity Analysis from a Single Video

A framework for unsupervised group activity analysis from a single video...

Unsupervised learning of clutter-resistant visual representations from natural videos

Populations of neurons in inferotemporal cortex (IT) maintain an explici...

Please sign up or login with your details

Forgot password? Click here to reset