Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

by   Zhi-Yi Chin, et al.
National Chiao Tung University

Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.


page 1

page 4

page 6


Erasing Concepts from Diffusion Models

Motivated by recent advancements in text-to-image diffusion, we study er...

Where's the Liability in Harmful AI Speech?

Generative AI, in particular text-based "foundation models" (large model...

Red-Teaming the Stable Diffusion Safety Filter

Stable Diffusion is a recent open-source image generation model comparab...

FLIRT: Feedback Loop In-context Red Teaming

Warning: this paper contains content that may be inappropriate or offens...

Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models

Large-scale image generation models, with impressive quality made possib...

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Larger language models (LLMs) have taken the world by storm with their m...

Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

The generative AI revolution in recent years has been spurred by an expa...

Please sign up or login with your details

Forgot password? Click here to reset