Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

by   Shihao Zhao, et al.

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at <https://github.com/ShihaoZhaoZSH/Uni-ControlNet>.


page 3

page 6

page 7

page 9

page 13

page 14

page 17

page 18


clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP

We introduce a new method to efficiently create text-to-image models fro...

FigGen: Text to Scientific Figure Generation

The generative modeling landscape has experienced tremendous growth in r...

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

Diffusion models have achieved remarkable success in text-to-image gener...

Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model

Text-to-image generative models have attracted rising attention for flex...

Adding Conditional Control to Text-to-Image Diffusion Models

We present a neural network structure, ControlNet, to control pretrained...

Discriminative Class Tokens for Text-to-Image Diffusion Models

Recent advances in text-to-image diffusion models have enabled the gener...

ChatAgri: Exploring Potentials of ChatGPT on Cross-linguistic Agricultural Text Classification

In the era of sustainable smart agriculture, a massive amount of agricul...

Please sign up or login with your details

Forgot password? Click here to reset