An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

09/05/2023
by   Yusheng Liao, et al.
0

Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are developed and evaluated on three constructed test sets. A medical consultation training set is further constructed to improve the consultation ability of LLMs. The results of the experiments show that fine-tuning with the training set can alleviate hallucinations and improve LLMs' performance on the proposed benchmark. Extensive experiments and ablation studies are conducted to validate the effectiveness and robustness of the proposed framework.

READ FULL TEXT
research
08/28/2023

DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation

We propose DISC-MedLLM, a comprehensive solution that leverages Large La...
research
07/20/2023

IvyGPT: InteractiVe Chinese pathwaY language model in medical domain

General large language models (LLMs) such as ChatGPT have shown remarkab...
research
07/28/2023

Med-HALT: Medical Domain Hallucination Test for Large Language Models

This research paper focuses on the challenges posed by hallucinations in...
research
03/20/2023

Capabilities of GPT-4 on Medical Challenge Problems

Large language models (LLMs) have demonstrated remarkable capabilities i...
research
06/14/2021

Probing Pre-Trained Language Models for Disease Knowledge

Pre-trained language models such as ClinicalBERT have achieved impressiv...
research
09/19/2023

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

To solve complex tasks, large language models (LLMs) often require multi...
research
09/08/2023

The CALLA Dataset: Probing LLMs' Interactive Knowledge Acquisition from Chinese Medical Literature

The application of Large Language Models (LLMs) to the medical domain ha...

Please sign up or login with your details

Forgot password? Click here to reset