Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

11/23/2021
by   Zhengyuan Yang, et al.
7

In this paper, we propose UNICORN, a vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture. Specifically, we quantize each box into four discrete box tokens and serialize them as a sequence, which can be integrated with text tokens. We formulate all VL problems as a generation task, where the target sequence consists of the integrated text and box tokens. We then train a transformer encoder-decoder to predict the target in an auto-regressive manner. With such a unified framework and input-output format, UNICORN achieves comparable performance to task-specific state of the art on 7 VL benchmarks, covering the visual grounding, grounded captioning, visual question answering, and image captioning tasks. When trained with multi-task finetuning, UNICORN can approach different VL tasks with a single set of parameters, thus crossing downstream task boundary. We show that having a single model not only saves parameters, but also further boosts the model performance on certain tasks. Finally, UNICORN shows the capability of generalizing to new tasks such as ImageNet object localization.

READ FULL TEXT

page 1

page 3

page 8

page 14

page 15

research
02/04/2021

Unifying Vision-and-Language Tasks via Text Generation

Existing methods for vision-and-language learning typically require desi...
research
06/14/2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years...
research
06/17/2022

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks...
research
08/17/2022

UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation

To satisfy various user needs, different subtasks of graphic layout gene...
research
08/17/2023

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

Natural Language Explanations (NLE) aim at supplementing the prediction ...
research
08/27/2023

Towards Unified Token Learning for Vision-Language Tracking

In this paper, we present a simple, flexible and effective vision-langua...
research
05/31/2021

Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model

Inspired by biological evolution, we explain the rationality of Vision T...

Please sign up or login with your details

Forgot password? Click here to reset