COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

08/07/2023
by   Chaoya Jiang, et al.
0

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.

READ FULL TEXT

page 2

page 3

page 8

research
08/04/2022

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Large-scale vision-language pre-training has shown impressive advances i...
research
07/17/2023

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models...
research
03/16/2023

GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning

A vision-language foundation model pretrained on very large-scale image-...
research
10/14/2022

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

Large-scale vision-language pre-trained (VLP) models are prone to halluc...
research
03/09/2023

Replacement as a Self-supervision for Fine-grained Vision-language Pre-training

Fine-grained supervision based on object annotations has been widely use...
research
07/12/2022

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Vision-Language Pre-training (VLP) with large-scale image-text pairs has...
research
04/10/2023

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

This paper presents DetCLIPv2, an efficient and scalable training framew...

Please sign up or login with your details

Forgot password? Click here to reset