Medinify is a general tool for medical text classification that also includes functionality for collecting drug review sentiment datasets from multiple online drug forums.
- Python 3.6
Ensure that you have git installed.
In Mac or Linux terminal:
git clone https://github.com/NLPatVCU/medinify.git
In order to manage dependencies/configuration virtual environments should be used. Ensure your current directory is the project installation directory. Then enter:
python3 -m venv venv
source venv/bin/activate
pip install -e .
All datasets used with Medinify must be in .csv format, and have a text column (containing the text being labelled) and a label column (containing the text labels).
If the data in the .csv file requires some extra processesing (i.e., if the labels are non-numeric, or if certain texts need to be removed) that functionality can be added by subclassing the Dataset class. SentimentDataset is an example of this, which is used for transforming star rating labels into sentiment labels.
If you want to use Medinify's scraping functionality to collect drug review sentiment datasets:
Datasets can be collected from three sources: a url, a .txt file containing a list of urls, or a .txt file containing a list of drug names. The method for collection for each is shown below:
from medinify.datasets import SentimentDataset
dataset = SentimentDataset()
"""
The default scraper is 'webmd', but the 'scraper' argument could also be set to 'everydayhealth', 'drugs', or 'drugratingz'
There are also 'collect_user_id' and 'collect_urls' arguments, both default false
"""
# For collecting from url
dataset.collect('<valid_url>')
# For collecting from .txt url file
dataset.collect_from_urls(urls_file='path/to/urls/file')
# For collecting from .txt drug names file
dataset.collect_from_drug_names(drug_names_file='path/to/drug/names/file')
# To save .csv file
dataset.write_file('output_file_name.csv')
In order to load .csv file into a dataset, the text column and label column must be specified
from medinify.datasets import Dataset
dataset = Dataset('path/to/csv', text_column='<text column name>', label_column='<label column name>')
Medinify provides functionality for training, evaluating, and classifying with Naive Bayes, Random Forest, Support Vector Machine, and Convolutional Neural Network classificaton models.
from medinify.datasets import SentimentDataset
from medinify.classifiers import Classifier
# create a Dataset object and load data from .csv file
dataset = SentimentDataset('path/to/csv/file')
# create classifier
clf = Classifier()
"""
For classifiers, both 'learner' and 'representation' arguments can be specified
(they are 'nb' (NaiveBayes) and 'bow' (Bag-of-Word) by default)
All learners have a default representation that produces the best results.
Another representation ('embeddings', 'bow', or 'matrix') can be specified, but be careful
because it may be incompatible with the learner
"""
# fit model
model = clf.fit(dataset)
# evaluate model
eval_dataset = SentimentDataset('path/to/eval/dataset')
clf.evaluate(eval_dataset, trained_model=model)
# classify using model
classification_dataset = SentimentDataset('path/to/dataset')
clf.classify(classification_dataset, output_file='output_file.txt', trained_model=model)
Trained models can be saved and loaded as pickle files
from medinify.datasets import SentimentDataset
from medinify.classifiers import Classifier
dataset = SentimentDataset('path/to/csv/file')
clf = Classifier()
model = clf.fit(dataset)
# save model
clf.save(model, 'path/to/save/model')
# load saved model
model2 = clf.load('saved/model/path')
Medinify has functionality for k-fold cross validation
from medinify.datasets import SentimentDataset
from medinify.classifiers import Classifier
dataset = SentimentDataset('path/to/csv/file')
clf = Classifier()
clf.validate(dataset, k_folds=5)
- Changes made/comitted/pushed in new branch
- Changes not far behind develop
- Added comments and documentation to code
- Made sure styling matches Google style guide: http://google.github.io/styleguide/pyguide.html
- README updated if relevant changes made
-
Copy the URL from the Medinify repository and use Git to clone the repo:
# Clone the repo into current directory git clone https://github.com/NanoNLP/medinify.git # Navigate to the newly cloned directory cd medinify
-
Create a branch off of develop to contain your changes:
git checkout -b <new-branch-name>
-
After making changes to files or adding new files to the project, stage your changes
git add <filename>
-
Next, we record the changes made and provide a message describing the changes made so others can understand
git commit -m "Description of changes made"
-
After committitng, make sure everything looks good with:
git status
and you will recieve an output similar to this:
On branch <new-branch-name> Your branch is ahead of 'origin/<new-branch-name>' by 1 commit. (use "git push" to publish your local commits) nothing to commit, working directory clean
-
Finally, push the changes to the new branch origin:
# If the branch doesn't exist on GitHub yet
git push --set-upstream origin test
# If the branch already exists
git push
After following the steps above, you can make a pull request directly on the Medinify GitHub. It should be a pull request to merge your new branch into develop.
Add a title, a description, and then press the “Create pull request” button. If you are closing an issue, put "closes #14", if you had issue 14.
Navigate to the reviewers tab and request a reviewer to review the PR.
@article{gurdin2020analysis,
title={Analysis of Inter-Domain and Cross-Domain Drug Review Polarity Classification},
author={Gurdin, Gabrielle and Vargas, Jorge A and Maffey, Luke G and Olex, Amy L and Lewinski, Nastassja A and McInnes, Bridget T},
journal={AMIA Summits on Translational Science Proceedings},
volume={2020},
pages={201},
year={2020},
publisher={American Medical Informatics Association}
}
Bridget McInnes, Jorge Vargas, Gabby Gurdin, Nathan West, Ishaan Thakur, Mark Groves