Unleashing Text-to-Image Diffusion Models for Visual Perception

03/03/2023
by   Wenliang Zhao, et al.
0

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3 image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2021

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Recent progress has shown that large-scale pre-training using contrastiv...
research
03/25/2023

Freestyle Layout-to-Image Synthesis

Typical layout-to-image synthesis (LIS) models generate images for a clo...
research
03/17/2023

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery

Learning from a large corpus of data, pre-trained models have achieved i...
research
05/25/2023

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models

Text-to-image (T2I) research has grown explosively in the past year, owi...
research
03/25/2023

IFSeg: Image-free Semantic Segmentation via Vision-Language Model

Vision-language (VL) pre-training has recently gained much attention for...
research
08/21/2023

Diffusion Model as Representation Learner

Diffusion Probabilistic Models (DPMs) have recently demonstrated impress...
research
06/01/2023

DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection

Existing deepfake detection methods fail to generalize well to unseen or...

Please sign up or login with your details

Forgot password? Click here to reset