Compressing Vision Transformers for Low-Resource Visual Learning

by   Eric Youn, et al.

Vision transformer (ViT) and its variants have swept through visual learning leaderboards and offer state-of-the-art accuracy in tasks such as image classification, object detection, and semantic segmentation by attending to different parts of the visual input and capturing long-range spatial dependencies. However, these models are large and computation-heavy. For instance, the recently proposed ViT-B model has 86M parameters making it impractical for deployment on resource-constrained devices. As a result, their deployment on mobile and edge scenarios is limited. In our work, we aim to take a step toward bringing vision transformers to the edge by utilizing popular model compression techniques such as distillation, pruning, and quantization. Our chosen application environment is an unmanned aerial vehicle (UAV) that is battery-powered and memory-constrained, carrying a single-board computer on the scale of an NVIDIA Jetson Nano with 4GB of RAM. On the other hand, the UAV requires high accuracy close to that of state-of-the-art ViTs to ensure safe object avoidance in autonomous navigation, or correct localization of humans in search-and-rescue. Inference latency should also be minimized given the application requirements. Hence, our target is to enable rapid inference of a vision transformer on an NVIDIA Jetson Nano (4GB) with minimal accuracy loss. This allows us to deploy ViTs on resource-constrained devices, opening up new possibilities in surveillance, environmental monitoring, etc. Our implementation is made available at


page 3

page 7


Training Strategies for Vision Transformers for Object Detection

Vision-based Transformer have shown huge application in the perception m...

Improving the Efficiency of Transformers for Resource-Constrained Devices

Transformers provide promising accuracy and have become popular and used...

Separable Self-attention for Mobile Vision Transformers

Mobile vision transformers (MobileViT) can achieve state-of-the-art perf...

Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction

Recently, vision transformers (ViT) have replaced convolutional neural n...

A Unified Pruning Framework for Vision Transformers

Recently, vision transformer (ViT) and its variants have achieved promis...

PnP-DETR: Towards Efficient Visual Analysis with Transformers

Recently, DETR pioneered the solution of vision tasks with transformers,...

Auto-Compressing Subset Pruning for Semantic Image Segmentation

State-of-the-art semantic segmentation models are characterized by high ...

Please sign up or login with your details

Forgot password? Click here to reset