Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

by   Shreyank N Gowda, et al.

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by ∼ 50%) and memory consumption while maintaining or slightly improving performance by up to 1.79% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption.


Video Swin Transformer

The vision community is witnessing a modeling shift from CNNs to Transfo...

Temporal Transformer Networks with Self-Supervision for Action Recognition

In recent years, 2D Convolutional Networks-based video action recognitio...

Performance Evaluation of Swin Vision Transformer Model using Gradient Accumulation Optimization Technique

Vision Transformers (ViTs) have emerged as a promising approach for visu...

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Current state-of-the-art models for video action recognition are mostly ...

Space-time Mixing Attention for Video Transformer

This paper is on video recognition using Transformers. Very recent attem...

Video Transformer Network

This paper presents VTN, a transformer-based framework for video recogni...

Efficient Convolution and Transformer-Based Network for Video Frame Interpolation

Video frame interpolation is an increasingly important research task wit...

Please sign up or login with your details

Forgot password? Click here to reset