License

This repository contains source code for the paper 'Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-gram Embeddings', which is published in Volume 7 of CLIN Journal. A shorter paper, which focuses exclusively on our English experiments, was presented at the BioNLP 2017 workshop at ACL: 'Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-gram Embeddings.' The source code offered here contains scripts to extract our manually annotated MIMIC-III data, and to run the experiments described in our paper.

License

MIT

Requirements

Python 3
Python 2.7
Numpy
pyxdameraulevenshtein
Facebook fastText
fasttext, a Python interface for Facebook fastText

All packages are available from pip, except fastText. To install these requirements, just run

pip install -r requirements.txt

from inside the cloned repository.

In order to build fastText, use the following:

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make

To extract our manually annotated MIMIC-III test data, you should have access to the MIMIC-III database.

Usage

Demo

To demo the context-sensitive spelling correction model with the best parameters from the experiments, go to the demo directory and follow the instructions in the README.

Extracting the English test data

To extract the annotated test data, run

python2.7 extract_test.py [path to NOTEEVENTS.csv file from the MIMIC-III database]

This script preprocesses the NOTEEVENTS.csv data and stores the preprocessed data in the file mimic_preprocessed.txt. It then extracts the annotated test data, which is stored to the file testcorpus.json in four lists: correct replacements, misspellings, misspelling contexts, and line indices.

Extracting development data and other resources

Preprocessing

To generate development corpora as described in the paper, the data has to be preprocessed. To preprocess English data, run

python3 preprocess.py [path to raw data] [path to created preprocessed data]

This script uses the source code of the English tokenizer from Pattern.

To preprocess Dutch data, you can use the Ucto tokenizer and, for every line, retain every token which matches

r'(^[^\d\W])[^\d\W]*(-[^\d\W]*)*([^\d\W]$)'

Generating frequency lists and neural embeddings

To extract a frequency list from the preprocessed data, run

python3 frequencies.py [path to preprocessed data] [language]

The [language] argument should always either be en if the language is English or nl if the language is Dutch.

To train the fastText vectors as we do, place the preprocessed data in the cloned fastText directory and run

./fasttext skipgram -input [path to preprocessed data] -output ../data/embeddings_[language] -dim 300

This makes an embeddings_[language].vec and embeddings_[language].bin file in the data repository. Only the embeddings_[language].bin file is used by the code.

Generating development corpora

To create a development corpus from preprocessed data, run

python3 make_devcorpus.py [path to preprocessed data] [language] [path to created devcorpus] [window size] [allow oov] [samplesize]

The [window size] argument specifies the minimal token window size on each side of a generated development instance. The [allow oov] argument should be False for development setup 1 or 2 from the paper, and True for development setup 3. The [samplesize] argument should contain the number of lines to sample from the data.

Conducting experiments

Generating candidates

To generate candidates for a created development corpus, run

python3 candidates.py [path to preprocessed data] 2 [name of output] [language]

To generate candidates for our extracted test data or other empirically observed data, run

python3 candidates.py [path to preprocessed data] all [name of output] [language]

Ranking experiments

The Development class in ranking_experiments.py contains all functions to conduct the experiments.

Example:

import ranking_experiments

# load devcorpus for setup 1, 2 and 3

with open('devcorpus_setup1.json', 'r') as f:
        corpusfiles_setup1 = json.load(f)
devcorpus_setup1 = corpusfiles_setup1[:3]

with open('devcorpus_setup2.json', 'r') as f:
        corpusfiles_setup2 = json.load(f)
devcorpus_setup2 = corpusfiles_setup2[:3]

with open('devcorpus_setup3.json', 'r') as f:
        corpusfiles_setup3 = json.load(f)
devcorpus_setup3 = corpusfiles_setup3[:3]

# load candidates for setup 1, 2 and 3
with open('candidates_devcorpus_setup1.json', 'r') as f:
        candidates_setup1 = json.load(f)
with open('candidates_devcorpus_setup2.json', 'r') as f:
        candidates_setup2 = json.load(f)
with open('candidates_devcorpus_setup3.json', 'r') as f:
        candidates_setup3 = json.load(f)

# perform grid search
scores_setup1 = Development.grid_search(devcorpus_setup1, candidates_setup1, language='en')
scores_setup2 = Development.grid_search(devcorpus_setup2, candidates_setup2, language='en')

# search for best averaged parameters
best_parameters = Development.define_best_parameters('iv'=[scores_setup1, scores_setup2])

# perform grid search for oov penalty
oov_scores_setup1 = Development.tune_oov(devcorpus_setup1, candidates_list, best_parameters, language='en')
oov_scores_setup2 = Development.tune_oov(devcorpus_setup2, candidates_list, best_parameters, language='en')
oov_scores_setup3 = Development.tune_oov(devcorpus_setup3, candidates_list, best_parameters, language='en')

# search for best averaged oov penalty
best_oov = Development.define_best_parameters('iv'=[oov_scores_setup1, oov_scores_setup2], 'oov'=oov_scores_setup3)

# store best parameters
best_parameters['oov_penalty'] = best_oov
with open('parameters.json', 'w') as f:
	json.dump(best_parameters, f)

# conduct ranking experiments with best parameters on test data

with open('testcorpus.json', 'r') as f:
	testfiles = json.load(f)
testcorpus = [testfiles[0], testfiles[1], testfiles[2]]

with open('testcandidates.json', 'r') as f:
        testcandidates = json.load(f)

# ranking experiment and analysis per frequency scenario for our context-sensitive model, noisy channel model, and majority frequency

best_parameters['ranking_method'] = 'context'
dev = Development(best_parameters, language='en')
accuracy_context, correction_list_context = dev.conduct_experiment(testcorpus, testcandidates)
frequency_analysis_context = dev.frequency_analysis()

best_parameters['ranking_method'] = 'noisy_channel'
dev = Development(best_parameters, language='en')
accuracy_noisychannel, correction_list_noisychannel = dev.conduct_experiment(testcorpus, testcandidates)
frequency_analysis_noisychannel = dev.frequency_analysis()

best_parameters['ranking_method'] = 'frequency'
dev = Development(best_parameters, language='en')
accuracy_frequency, correction_list_frequency = dev.conduct_experiment(testcorpus, testcandidates)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
code		code
data		data
demo		demo
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

License

Requirements

Usage

Demo

Extracting the English test data

Extracting development data and other resources

Preprocessing

Generating frequency lists and neural embeddings

Generating development corpora

Conducting experiments

Generating candidates

Ranking experiments

About

Releases

Packages

Languages

PieterFivez/clinspell

Folders and files

Latest commit

History

Repository files navigation

License

Requirements

Usage

Demo

Extracting the English test data

Extracting development data and other resources

Preprocessing

Generating frequency lists and neural embeddings

Generating development corpora

Conducting experiments

Generating candidates

Ranking experiments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages