An Investigation of Evaluation Metrics for Automated Medical Note Generation

by   Asma Ben Abacha, et al.

Recent studies on automatic note generation have shown that doctors can save significant amounts of time when using automatic clinical note generation (Knoll et al., 2022). Summarization models have been used for this task to generate clinical notes as summaries of doctor-patient conversations (Krishna et al., 2021; Cai et al., 2022). However, assessing which model would best serve clinicians in their daily practice is still a challenging task due to the large set of possible correct summaries, and the potential limitations of automatic evaluation metrics. In this paper, we study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations. In particular, we propose new task-specific metrics and we compare them to SOTA evaluation metrics in text summarization and generation, including: (i) knowledge-graph embedding-based metrics, (ii) customized model-based metrics, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble metrics. To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts and computing the factual correctness, and the hallucination and omission rates for critical medical facts. This study relied on seven datasets manually annotated by domain experts. Our experiments show that automatic evaluation metrics can have substantially different behaviors on different types of clinical notes datasets. However, the results highlight one stable subset of metrics as the most correlated with human judgments with a relevant aggregation of different evaluation criteria.


page 1

page 2

page 3

page 4


Revisiting Automatic Question Summarization Evaluation in the Biomedical Domain

Automatic evaluation metrics have been facilitating the rapid developmen...

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

In recent years, machine learning models have rapidly become better at g...

A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Long-form clinical summarization of hospital admissions has real-world s...

Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation

Evaluating automatically generated text is generally hard due to the inh...

ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Recent immense breakthroughs in generative models such as in GPT4 have p...

SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

Reliable automatic evaluation of summarization systems is challenging du...

CUED at ProbSum 2023: Hierarchical Ensemble of Summarization Models

In this paper, we consider the challenge of summarizing patients' medica...

Please sign up or login with your details

Forgot password? Click here to reset