Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

05/23/2022
by   Chitwan Saharia, et al.
0

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

READ FULL TEXT

page 2

page 16

page 17

page 18

page 27

page 31

page 35

page 40

research
10/27/2022

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

Recent progress in diffusion models has revolutionized the popular techn...
research
01/23/2023

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Text-to-image synthesis has recently seen significant progress thanks to...
research
06/22/2022

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

We present the Pathways Autoregressive Text-to-Image (Parti) model, whic...
research
05/18/2023

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

Existing automatic evaluation on text-to-image synthesis can only provid...
research
07/10/2023

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

The field of text-conditioned image generation has made unparalleled pro...
research
05/24/2023

I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

Visual metaphors are powerful rhetorical devices used to persuade or com...
research
11/14/2022

Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models

Transferring large amount of high resolution images over limited bandwid...

Please sign up or login with your details

Forgot password? Click here to reset