This repository contains source code for paper "PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation"(accepted by EMNLP'20 The 3rd Clinical Natural Language Processing Workshop). PHICON is a simple yet effective data augmentation method to alleviate the generalization issue in de-identification. PHICON consists of PHI augmentation and Context augmentation (as shown in Figure 1), which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively.
Figure. 1: Toy examples of our PHICON data augmentation. SR: synonym replacement. RI: random insertion.- Download Stanford Parser, and change the corresponding path in rule_modules.py file
- Install spaCy package
The i2b2 2006 and i2b2 2014 de-identification dataset can be accessed from: https://portal.dbmi.hms.harvard.edu.
The data processing mainly refers to the guidance from:
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2006
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2014
We also show detailed steps on data process and PHI augmentation in the following two files:
PHI augmentation-i2b2-2006 dataset.ipynb
PHI augmentation-i2b2-2014 dataset.ipynb
If users already have de-identification datasets in BIO format, users can directly conduct PHI Augmentation according to the guidance in this file:
PHI augmentation-your-own-dataset.ipynb
python Context_Aug.py
Please kindly cite the paper if you use the code or any resources in this repo:
@inproceedings{yue2020phicon,
title={PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation},
author={Xiang Yue and Shuang Zhou},
booktitle={Proceedings of the 3rd Clinical Natural Language Processing Workshop},
year={2020}
}