Enriched Annotations for Tumor Attribute Classification from Pathology Reports with Limited Labeled Data

12/15/2020
by   Nick Altieri, et al.
9

Precision medicine has the potential to revolutionize healthcare, but much of the data for patients is locked away in unstructured free-text, limiting research and delivery of effective personalized treatments. Generating large annotated datasets for information extraction from clinical notes is often challenging and expensive due to the high level of expertise needed for high quality annotations. To enable natural language processing for small dataset sizes, we develop a novel enriched hierarchical annotation scheme and algorithm, Supervised Line Attention (SLA), and apply this algorithm to predicting categorical tumor attributes from kidney and colon cancer pathology reports from the University of California San Francisco (UCSF). Whereas previous work only annotated document level labels, we in addition ask the annotators to enrich the traditional label by asking them to also highlight the relevant line or potentially lines for the final label, which leads to a 20 increase of annotation time required per document. With the enriched annotations, we develop a simple and interpretable machine learning algorithm that first predicts the relevant lines in the document and then predicts the tumor attribute. Our results show across the small dataset sizes of 32, 64, 128, and 186 labeled documents per cancer, SLA only requires half the number of labeled documents as state-of-the-art methods to achieve similar or better micro-f1 and macro-f1 scores for the vast majority of comparisons that we made. Accounting for the increased annotation time, this leads to a 40 total annotation time over the state of the art.

READ FULL TEXT
research
09/18/2018

Lung Cancer Concept Annotation from Spanish Clinical Narratives

Recent rapid increase in the generation of clinical data and rapid devel...
research
04/13/2022

WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic Segmentation for Lung Adenocarcinoma

Lung cancer is the leading cause of cancer death worldwide, and adenocar...
research
02/06/2023

Interface Design for Crowdsourcing Hierarchical Multi-Label Text Annotations

Human data labeling is an important and expensive task at the heart of s...
research
08/07/2023

Extracting detailed oncologic history and treatment plan from medical oncology notes with large language models

Both medical care and observational studies in oncology require a thorou...
research
09/13/2021

WeakSTIL: Weak whole-slide image level stromal tumor infiltrating lymphocyte scores are all you need

We present WeakSTIL, an interpretable two-stage weak label deep learning...
research
10/14/2021

MIMICause : Defining, identifying and predicting types of causal relationships between biomedical concepts from clinical notes

Understanding of causal narratives communicated in clinical notes can he...
research
05/08/2020

HiJoD: Semi-Supervised Multi-aspect Detection of Misinformation using Hierarchical Joint Decomposition

Distinguishing between misinformation and real information is one of the...

Please sign up or login with your details

Forgot password? Click here to reset