Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure

by   Philipp Koralus, et al.

Increase in computational scale and fine-tuning has seen a dramatic improvement in the quality of outputs of large language models (LLMs) like GPT. Given that both GPT-3 and GPT-4 were trained on large quantities of human-generated text, we might ask to what extent their outputs reflect patterns of human thinking, both for correct and incorrect cases. The Erotetic Theory of Reason (ETR) provides a symbolic generative model of both human success and failure in thinking, across propositional, quantified, and probabilistic reasoning, as well as decision-making. We presented GPT-3, GPT-3.5, and GPT-4 with 61 central inference and judgment problems from a recent book-length presentation of ETR, consisting of experimentally verified data-points on human judgment and extrapolated data-points predicted by ETR, with correct inference patterns as well as fallacies and framing effects (the ETR61 benchmark). ETR61 includes classics like Wason's card task, illusory inferences, the decoy effect, and opportunity-cost neglect, among others. GPT-3 showed evidence of ETR-predicted outputs for 59 77 fallacious judgments increased from 18 GPT-4. This suggests that larger and more advanced LLMs may develop a tendency toward more human-like mistakes, as relevant thought patterns are inherent in human-produced training data. According to ETR, the same fundamental patterns are involved both in successful and unsuccessful ordinary reasoning, so that the "bad" cases could paradoxically be learned from the "good" cases. We further present preliminary evidence that ETR-inspired prompt engineering could reduce instances of these mistakes.


page 1

page 2

page 3

page 4


Chain-of-Thought Reasoning is a Policy Improvement Operator

Large language models have astounded the world with fascinating new capa...

Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango

Reasoning is a key pillar of human cognition and intelligence. In the pa...

The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

Large-scale pretrained language models are the major driving force behin...

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

Vision-language models (VLMs) have recently demonstrated strong efficacy...

A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning

We conduct a pilot study selectively evaluating the cognitive abilities ...

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

Prompt engineering is an increasingly important skill set needed to conv...

Please sign up or login with your details

Forgot password? Click here to reset