GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

03/17/2023
by   Can Qin, et al.
0

Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.

READ FULL TEXT

page 16

page 18

page 19

page 20

page 22

page 23

page 24

page 26

research
06/20/2023

Align, Adapt and Inject: Sound-guided Unified Image Generation

Text-guided image generation has witnessed unprecedented progress due to...
research
05/09/2023

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Diffusion models, which have emerged to become popular text-to-image gen...
research
05/16/2023

A Method for Training-free Person Image Picture Generation

The current state-of-the-art Diffusion model has demonstrated excellent ...
research
02/16/2023

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Recent advances in text-to-image generation with diffusion models presen...
research
07/10/2023

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

The field of text-conditioned image generation has made unparalleled pro...
research
10/16/2022

LAION-5B: An open large-scale dataset for training next generation image-text models

Groundbreaking language-vision architectures like CLIP and DALL-E proved...
research
05/24/2023

Text encoders are performance bottlenecks in contrastive vision-language models

Performant vision-language (VL) models like CLIP represent captions usin...

Please sign up or login with your details

Forgot password? Click here to reset