LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

07/20/2023
by   David Glukhov, et al.
0

Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LM to detect undesirable content in LLM outputs. In this paper, we present the theoretical limitations of such semantic censorship approaches. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs from a collection of permissible ones. As a result, we propose that the problem of censorship needs to be reevaluated; it should be treated as a security problem which warrants the adaptation of security-based approaches to mitigate potential risks.

READ FULL TEXT

page 6

page 13

research
08/17/2023

Do you really follow me? Adversarial Instructions for Evaluating the Robustness of Large Language Models

Large Language Models (LLMs) have shown remarkable proficiency in follow...
research
02/11/2023

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

Recent advances in instruction-following large language models (LLMs) ha...
research
09/14/2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Training large language models to follow instructions makes them perform...
research
04/27/2023

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

Large language models (LLMs) with instruction finetuning demonstrate sup...
research
05/09/2023

Towards Building the Federated GPT: Federated Instruction Tuning

While “instruction-tuned" generative large language models (LLMs) have d...
research
04/18/2023

Stochastic Parrots Looking for Stochastic Parrots: LLMs are Easy to Fine-Tune and Hard to Detect with other LLMs

The self-attention revolution allowed generative language models to scal...
research
12/15/2021

Do You See What I See? Capabilities and Limits of Automated Multimedia Content Analysis

The ever-increasing amount of user-generated content online has led, in ...

Please sign up or login with your details

Forgot password? Click here to reset