InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

05/11/2023
by   Wenliang Dai, et al.
0

General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7 Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

READ FULL TEXT

page 2

page 3

page 13

research
04/17/2023

Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated ...
research
05/18/2023

Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot Relation Extractors

Recent work has shown that fine-tuning large language models (LLMs) on l...
research
07/04/2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Document understanding refers to automatically extract, analyze and comp...
research
07/31/2023

Camoscio: an Italian Instruction-tuned LLaMA

In recent years Large Language Models (LLMs) have increased the state of...
research
05/24/2023

PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

As advances in large language models (LLMs) and multimodal techniques co...
research
08/08/2023

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significa...
research
01/31/2023

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

We study the design decisions of publicly available instruction tuning m...

Please sign up or login with your details

Forgot password? Click here to reset