SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters

05/20/2023
by   Yuchen Yang, et al.
0

Text-to-image generative models such as Stable Diffusion and DALL·E 2 have attracted much attention since their publication due to their wide application in the real world. One challenging problem of text-to-image generative models is the generation of Not-Safe-for-Work (NSFW) content, e.g., those related to violence and adult. Therefore, a common practice is to deploy a so-called safety filter, which blocks NSFW content based on either text or image features. Prior works have studied the possible bypass of such safety filters. However, existing works are largely manual and specific to Stable Diffusion's official safety filter. Moreover, the bypass ratio of Stable Diffusion's safety filter is as low as 23.51 In this paper, we propose the first automated attack framework, called SneakyPrompt, to evaluate the robustness of real-world safety filters in state-of-the-art text-to-image generative models. Our key insight is to search for alternative tokens in a prompt that generates NSFW images so that the generated prompt (called an adversarial prompt) bypasses existing safety filters. Specifically, SneakyPrompt utilizes reinforcement learning (RL) to guide an agent with positive rewards on semantic similarity and bypass success. Our evaluation shows that SneakyPrompt successfully generated NSFW content using an online model DALL·E 2 with its default, closed-box safety filter enabled. At the same time, we also deploy several open-source state-of-the-art safety filters on a Stable Diffusion model and show that SneakyPrompt not only successfully generates NSFW content, but also outperforms existing adversarial attacks in terms of the number of queries and image qualities.

READ FULL TEXT

page 4

page 5

page 14

research
10/03/2022

Red-Teaming the Stable Diffusion Safety Filter

Stable Diffusion is a recent open-source image generation model comparab...
research
03/29/2023

A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion

Despite the record-breaking performance in Text-to-Image (T2I) generatio...
research
06/09/2023

Safety and Fairness for Content Moderation in Generative Models

With significant advances in generative AI, new technologies are rapidly...
research
05/22/2023

Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

The generative AI revolution in recent years has been spurred by an expa...
research
09/06/2023

Certifying LLM Safety against Adversarial Prompting

Large language models (LLMs) released for public use incorporate guardra...
research
10/17/2022

Deepfake Text Detection: Limitations and Opportunities

Recent advances in generative models for language have enabled the creat...
research
08/08/2023

FLIRT: Feedback Loop In-context Red Teaming

Warning: this paper contains content that may be inappropriate or offens...

Please sign up or login with your details

Forgot password? Click here to reset