Dysbiosis in Metabolic Genes of the Gut Microbiomes of Patients with an Ileo-anal Pouch Resembles That Observed in Crohn's Disease
This is the code used in the manuscript to generate a machine learning model, i.e. pouchitis classifier, to distinguish between patients with a pouch clinical phenotypes (normal pouch vs. pouchitis) based on bacterial speceis, metabolic pathways or enzmes profiles from shotgun metagenomic data. In addition, after training the classifier on the discovery cohort (patients with a normal pouch and with pouchitis, N=208 samples), the classifier can be used to predict which patients with samples defined as recurrent-acute pouchitis phenotypes (from the validation cohort, N=42 samples), will become normal pouch (disease improvement) or pouchitis (disease aggravation) in follow up clinic visits. The prediction performance were: accuracy of ~ 76.2%, sensitivity of 88.9% and specificity of 53.3%. The classifier is built using the xgboost model, which is an algorithm of gradient boosting trees (GBT). You can change the model and use for example random forest or any other algorithm you prefer, but the code is written to be used specifically with xgboost package. For more information about xgboost, including a nice introduction to boosted trees, go to https://xgboost.readthedocs.io/en/latest/tutorials/model.html
You need to have Python version >=3.0 and the following modules installed:
sklearn
pandas
numpy
matplotlib
In addition, you need to install XGBoost (eXtreme Gradient Boosting) module. If you are usually installing Python modules with pip, use:
pip3 install xgboost
If you are working with Conda, use:
conda install -c conda-forge xgboost
Three scripts are provided: pouchitis_classifiers_train_model.py
, pouchitis_classifiers_plot_mean_ROC.py
and pouchitis_classifiers_test_validation.py
. The first one is used to build the model with defined hyperparameters, to train it on the discovery cohort using repeated k-fold cross validation and finally, to obtain different performance metrices and to plot the highest scoring features used in the classification.
The second script is used to generate mean ROC AUC with standard deviation repeated over k-folds. In this way we can see the variance of the curve when the training set is split into different subsets. This can show how the classifier predictions are affected by changes in the training data, and how different the splits generated by K-fold cross-validation are from one another.
The third script is to test the model on the validation set (cohort of patients with recurrent-acute pouchitis phenotype) and generate performance metrices including ROC AUC and a confusion matrix.
The best way to run the code, is sequentially run the 3 provided script in the above mentioned order.
There are several way to tune the hyperparameters of every model. In the manuscript the hyperparameters were tuned empirically using grid search. In the code the model is set up with the hyperparameters that gave the highest ROC AUC scores for the metagenomic dataset used in the analysis. Feel free to experiment differently with the parameters, you might get different results for even the same dataset (depending on their combination). For more information about xgboost tunable parameters, see https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters
Dubinsky, V., Reshef, L., Rabinowitz, K., Yadgar, K., Godny, L., Zonensain, K., Wasserberg, N., Dotan, I. and Gophna, U., 2021. Dysbiosis in metabolic genes of the gut microbiomes of patients with an ileo-anal pouch resembles that observed in Crohn's Disease. Msystems, 6(2)