Visual Adversarial Examples Jailbreak Large Language Models

by   Xiangyu Qi, et al.

Recently, there has been a surge of interest in introducing vision into Large Language Models (LLMs). The proliferation of large Visual Language Models (VLMs), such as Flamingo, BLIP-2, and GPT-4, signifies an exciting convergence of advancements in both visual and language foundation models. Yet, the risks associated with this integrative approach are largely unexamined. In this paper, we shed light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the additional visual input space intrinsically makes it a fertile ground for adversarial attacks. This unavoidably expands the attack surfaces of LLMs. Second, we highlight that the broad functionality of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. To elucidate these risks, we study adversarial examples in the visual input space of a VLM. Specifically, against MiniGPT-4, which incorporates safety mechanisms that can refuse harmful instructions, we present visual adversarial examples that can circumvent the safety mechanisms and provoke harmful behaviors of the model. Remarkably, we discover that adversarial examples, even if optimized on a narrow, manually curated derogatory corpus against specific social groups, can universally jailbreak the model's safety mechanisms. A single such adversarial example can generally undermine MiniGPT-4's safety, enabling it to heed a wide range of harmful instructions and produce harmful content far beyond simply imitating the derogatory corpus used in optimization. Unveiling these risks, we accentuate the urgent need for comprehensive risk assessments, robust defense strategies, and the implementation of responsible practices for the secure and safe utilization of VLMs.


MULDEF: Multi-model-based Defense Against Adversarial Examples for Neural Networks

Despite being popularly used in many application domains such as image r...

How Robust is Google's Bard to Adversarial Image Attacks?

Multimodal Large Language Models (MLLMs) that integrate text and other m...

Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation

Language Models today provide a high accuracy across a large number of d...

On Evaluating Adversarial Robustness of Large Vision-Language Models

Large vision-language models (VLMs) such as GPT-4 have achieved unpreced...

A LLM Assisted Exploitation of AI-Guardian

Large language models (LLMs) are now highly capable at a diverse range o...

A Categorical Archive of ChatGPT Failures

Large language models have been demonstrated to be valuable in different...

Certifying LLM Safety against Adversarial Prompting

Large language models (LLMs) released for public use incorporate guardra...

Please sign up or login with your details

Forgot password? Click here to reset