Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

07/17/2023
by   Huachuan Qiu, et al.
0

Researchers have invested considerable effort into ensuring that large language models (LLMs) align with human values, using various training techniques, such as instruction tuning and Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF), to guard against text unsafety. However, these defenses remain incredibly vulnerable to some jailbreak attacks, which can cause the model to become overly defensive to sensitive topics or still generate harmful content, leaving the model performance particularly fragile. Therefore, to comprehensively study text safety and output robustness, we propose a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, where the text to be translated contains malicious instructions. To further analyze the safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs concerning the position of explicit normal instructions, word replacement (verbs in explicit normal instructions, target groups in malicious instructions, cue words in malicious instructions), and instruction replacement (different explicit normal instructions). Our results show that current LLMs not only have a preference for certain instruction verbs, but also exhibit different jailbreak rates for different instruction verbs in explicit normal instructions. In other words, the probability of generating unsafe content by the model will be reinforced to varying degrees depending on the instruction verb in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.

READ FULL TEXT

page 3

page 7

research
09/14/2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Training large language models to follow instructions makes them perform...
research
04/24/2023

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Training large language models (LLM) with open-domain instruction follow...
research
08/17/2023

Do you really follow me? Adversarial Instructions for Evaluating the Robustness of Large Language Models

Large Language Models (LLMs) have shown remarkable proficiency in follow...
research
07/31/2023

Virtual Prompt Injection for Instruction-Tuned Large Language Models

We present Virtual Prompt Injection (VPI) for instruction-tuned Large La...
research
03/14/2022

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

Providing natural language instructions in prompts is a useful new parad...
research
04/20/2023

Safety Assessment of Chinese Large Language Models

With the rapid popularity of large language models such as ChatGPT and G...
research
05/19/2023

InstructIE: A Chinese Instruction-based Information Extraction Dataset

We introduce a new Information Extraction (IE) task dubbed Instruction-b...

Please sign up or login with your details

Forgot password? Click here to reset