DIME-FM: DIstilling Multimodal and Efficient Foundation Models

03/31/2023
by   Ximeng Sun, et al.
0

Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.

READ FULL TEXT

page 13

page 14

page 15

research
06/20/2023

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Pre-trained Vision-Language Foundation Models utilizing extensive image-...
research
11/22/2021

Florence: A New Foundation Model for Computer Vision

Automated visual understanding of our diverse and open world demands com...
research
04/05/2023

Towards Efficient Task-Driven Model Reprogramming with Foundation Models

Vision foundation models exhibit impressive power, benefiting from the e...
research
08/21/2023

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

Machine learning has demonstrated remarkable performance over finite dat...
research
09/14/2023

Efficiently Robustify Pre-trained Models

A recent trend in deep learning algorithms has been towards training lar...
research
11/19/2021

Combined Scaling for Zero-shot Transfer Learning

We present a combined scaling method called BASIC that achieves 85.7 zer...
research
05/20/2023

AnyPredict: Foundation Model for Tabular Prediction

Foundation models are pre-trained on massive data to perform well across...

Please sign up or login with your details

Forgot password? Click here to reset