Generative Pretraining in Multimodality

07/11/2023
by   Quan Sun, et al.
0

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

READ FULL TEXT

page 2

page 6

page 20

page 21

page 22

page 23

page 24

page 27

research
02/27/2023

Language Is Not All You Need: Aligning Perception with Language Models

A big convergence of language, multimodal perception, action, and world ...
research
01/28/2021

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

We present Vx2Text, a framework for text generation from multimodal inpu...
research
07/16/2023

Planting a SEED of Vision in Large Language Model

We present SEED, an elaborate image tokenizer that empowers Large Langua...
research
03/21/2023

MAGVLT: Masked Generative Vision-and-Language Transformer

While generative modeling on multimodal image-text data has been activel...
research
12/10/2022

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

In this paper, we propose an end-to-end Retrieval-Augmented Visual Langu...
research
12/02/2022

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

We present an effective method for fusing visual-and-language representa...
research
06/29/2023

ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles

Automatically generating textual content with desired attributes is an a...

Please sign up or login with your details

Forgot password? Click here to reset