Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

05/07/2023
by   Miles Turpin, et al.
0

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs – e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)" – which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations supporting those answers. This causes accuracy to drop by as much as 36 suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. CoT is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2023

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

Chain-of-thought (CoT) prompting enables large language models (LLMs) to...
research
09/20/2022

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

When answering a question, humans utilize the information available acro...
research
05/31/2023

Majority Rule: better patching via Self-Consistency

Large Language models (LLMs) can be induced to solve non-trivial problem...
research
10/13/2022

Explanations from Large Language Models Make Small Reasoners Better

Integrating free-text explanations to in-context learning of large langu...
research
04/05/2022

Can language models learn from explanations in context?

Large language models can perform new tasks by adapting to a few in-cont...
research
05/24/2023

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective

Recent studies have discovered that Chain-of-Thought prompting (CoT) can...
research
06/09/2023

Using Foundation Models to Detect Policy Violations with Minimal Supervision

Foundation models, i.e. large neural networks pre-trained on large text ...

Please sign up or login with your details

Forgot password? Click here to reset