Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions

by   Myriam Safrai, et al.

As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple choice and open ended questions. We gathered small talk sentences from human participants using the Mechanical Turk platform. Both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. A board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data for multiple-choice questions (72.1% vs. 68.9%) and open questions (61.5% vs. 44.3%; p=0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6% and 66.2%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.


page 7

page 8


Can large language models reason about medical questions?

Although large language models (LLMs) often produce impressive outputs, ...

Putting ChatGPT's Medical Advice to the (Turing) Test

Objective: Assess the feasibility of using ChatGPT or a similar AI-based...

A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology

In recent years, there have been significant breakthroughs in the field ...

Performance of ChatGPT on USMLE: Unlocking the Potential of Large Language Models for AI-Assisted Medical Education

Artificial intelligence is gaining traction in more ways than ever befor...

Language models are susceptible to incorrect patient self-diagnosis in medical applications

Large language models (LLMs) are becoming increasingly relevant as a pot...

Learning to Generate Questions by Enhancing Text Generation with Sentence Selection

We introduce an approach for the answer-aware question generation proble...

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Circuit analysis is a promising technique for understanding the internal...

Please sign up or login with your details

Forgot password? Click here to reset