PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation

10/11/2020
by   Xiang Yue, et al.
0

De-identification is the task of identifying protected health information (PHI) in the clinical text. Existing neural de-identification models often fail to generalize to a new dataset. We propose a simple yet effective data augmentation method PHICON to alleviate the generalization issue. PHICON consists of PHI augmentation and Context augmentation, which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively. Experimental results on the i2b2 2006 and 2014 de-identification challenge datasets show that PHICON can help three selected de-identification models boost F1-score (by at most 8.6 cross-dataset test setting. We also discuss how much augmentation to use and how each augmentation method influences the performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2023

Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification

The identification of key factors such as medications, diseases, and rel...
research
12/12/2019

Training without training data: Improving the generalizability of automated medical abbreviation disambiguation

Abbreviation disambiguation is important for automated clinical note pro...
research
06/25/2022

ConcreteGraph: A Data Augmentation Method Leveraging the Properties of Concept Relatedness Estimation

The concept relatedness estimation (CRE) task is to determine whether tw...
research
06/06/2023

Augmenting Reddit Posts to Determine Wellness Dimensions impacting Mental Health

Amid ongoing health crisis, there is a growing necessity to discern poss...
research
06/07/2021

CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System

Information-seeking dialogue systems, including knowledge identification...
research
02/10/2023

Cross-Corpora Spoken Language Identification with Domain Diversification and Generalization

This work addresses the cross-corpora generalization issue for the low-r...
research
04/30/2021

Adapting Coreference Resolution for Processing Violent Death Narratives

Coreference resolution is an important component in analyzing narrative ...

Please sign up or login with your details

Forgot password? Click here to reset