The Poison of Alignment

08/25/2023

∙

From the perspective of content safety issues, alignment has shown to limit large language models' (LLMs) harmful content generation. This intentional method of reinforcing models to not respond to certain user inputs seem to be present in many modern open-source instruction tuning datasets such as OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment in supervised fine-tuning dataset. To be specific, we noticed that alignment acts as if it is poisoning the instruction dataset. Experimentally, we demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks such as Big Bench (BBH), Massive Multitask Language Understanding (MMLU), Human Eval, and Discrete Reasoning Over Paragraphs (DROP), performing worse than the counterpart tuned without alignment by 4-33

READ FULL TEXT

The Poison of Alignment

AMR Parsing with Instruction Fine-tuned Pre-trained Language Models

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective

Making Large Language Models Better Reasoners with Alignment

Poisoning Language Models During Instruction Tuning

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

The Poison of Alignment

Related Research

AMR Parsing with Instruction Fine-tuned Pre-trained Language Models

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective

Making Large Language Models Better Reasoners with Alignment

Poisoning Language Models During Instruction Tuning

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment