Vector Quantized Diffusion Model for Text-to-Image Synthesis

11/29/2021
by   Shuyang Gu, et al.
10

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.

READ FULL TEXT

page 6

page 7

page 12

page 13

page 14

research
10/05/2022

Progressive Denoising Model for Fine-Grained Text-to-Image Generation

Recently, vector quantized autoregressive (VQ-AR) models have shown rema...
research
07/20/2022

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Generating sound effects that humans want is an important topic. However...
research
08/19/2022

Vector Quantized Diffusion Model with CodeUnet for Text-to-Sign Pose Sequences Generation

Sign Language Production (SLP) aims to translate spoken languages into s...
research
04/10/2023

Binary Latent Diffusion

In this paper, we show that a binary latent space can be explored for co...
research
07/31/2023

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Expressive text-to-speech systems have undergone significant advancement...
research
12/01/2021

Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation

Recently, vector-quantized image modeling has demonstrated impressive pe...
research
03/19/2022

ALAP-AE: As-Lite-as-Possible Auto-Encoder

We present a novel algorithm to reduce tensor compute required by a cond...

Please sign up or login with your details

Forgot password? Click here to reset