A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning

by   Zhisheng Tang, et al.

We conduct a pilot study selectively evaluating the cognitive abilities (decision making and spatial reasoning) of two recently released generative transformer models, ChatGPT and DALL-E 2. Input prompts were constructed following neutral a priori guidelines, rather than adversarial intent. Post hoc qualitative analysis of the outputs shows that DALL-E 2 is able to generate at least one correct image for each spatial reasoning prompt, but most images generated are incorrect (even though the model seems to have a clear understanding of the objects mentioned in the prompt). Similarly, in evaluating ChatGPT on the rationality axioms developed under the classical Von Neumann-Morgenstern utility theorem, we find that, although it demonstrates some level of rational decision-making, many of its decisions violate at least one of the axioms even under reasonable constructions of preferences, bets, and decision-making prompts. ChatGPT's outputs on such problems generally tended to be unpredictable: even as it made irrational decisions (or employed an incorrect reasoning process) for some simpler decision-making problems, it was able to draw correct conclusions for more complex bet structures. We briefly comment on the nuances and challenges involved in scaling up such a 'cognitive' evaluation or conducting it with a closed set of answer keys ('ground truth'), given that these models are inherently generative and open-ended in responding to prompts.


page 6

page 7

page 8

page 9

page 10

page 11


Did You Get The Gist Of It? Understanding How Visualization Impacts Decision-Making

As visualization researchers evaluate the impact of visualization design...

Evaluating Superhuman Models with Consistency Checks

If machine learning models were to achieve superhuman abilities at vario...

MADDM: Multi-Advisor Dynamic Binary Decision-Making by Maximizing the Utility

Being able to infer ground truth from the responses of multiple imperfec...

Can Language Representation Models Think in Bets?

In recent years, transformer-based language representation models (LRMs)...

Is knowledge the key? An experiment on debiasing architectural decision-making – a pilot study

The impact of cognitive biases on architectural decision-making has been...

A 3D Game Theoretical Framework for the Evaluation of Unmanned Aircraft Systems Airspace Integration Concepts

Predicting the outcomes of integrating Unmanned Aerial Systems (UAS) int...

Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure

Increase in computational scale and fine-tuning has seen a dramatic impr...

Please sign up or login with your details

Forgot password? Click here to reset