Comparison of the methods of molecular representation in the task of classifying the activity of a molecule (in SMILES notation) in drug discovery experiments and searching for the optimal combination of physical and structural descriptors.
Descriptors:
- MACCS (maccs)
- morgan/ECFP (morgan)
- RDKit (rdkit)
- mordred (mordred)
- spectrophore (spectrophore)
Models:
- KNN (knn)
- Logistic regression (lr)
- RandomForestClassifier (rf)
- SVC (svc)
- XGBClassifier (xgb)
- Isolation Forest (if)
- FCNN (fcnn)
- MLP (mlp, mlp_sklearn)
Splits:
- random
- stratified
- scaffold
- cluster
- workbench for tables of experiments with large number of easily accessible parameters and hyperparameters
- featurization, effective work with processed datasets, feature combinations
- feature importance
- feature selection
- Seq2Seq for transfer learning
- Results
- Install
- Usage
- Input
- Output
- Datasets
- Data config
- Model config
- Single experiment
- Experiments table
- Utilities
- Citation
- It is good idea to use: git clone https://github.com/DentonJC/virtual_screening.git --depth=1
- Linux
- Python 3.6+ (Python 2.7 unstable)
- source env.sh
- It is better to use Theano backend.
-
sh setup.sh
or
-
conda install --file requirements
-
conda install -c conda-forge xgboost
-
conda install -c openbabel openbabel
-
conda install -c rdkit rdkit
-
conda install -c mordred-descriptor mordred
-
Python3: pip install configparser
-
Python2: pip install ConfigParser
-
pip install argparse
-
pip install git+git://github.com/DentonJC/virtual_screening
or
-
Packages from requirements
-
pip install xgboost
-
pip install mordred
-
Python3: pip install configparser
-
Python2: pip install ConfigParser
-
pip install argparse
usage: model data section [-h] [--select_model SELECT_MODEL]
[--data_config DATA_CONFIG] [--section SECTION]
[--load_model LOAD_MODEL]
[--descriptors DESCRIPTORS] [--output OUTPUT]
[--model_config MODEL_CONFIG] [--n_bits N_BITS]
[--n_cv N_CV] [--n_iter N_ITER] [--n_jobs N_JOBS]
[--patience PATIENCE] [--gridsearch]
[--metric {accuracy,roc_auc,f1,matthews}]
[--split_type {stratified,scaffold,random,cluster}]
[--split_size SPLIT_SIZE] [--targets TARGETS]
[--experiments_file EXPERIMENTS_FILE]
optional arguments:
-h, --help show this help message and exit
--select_model SELECT_MODEL
name of the model, select from list in README
--data_config DATA_CONFIG
path to dataset config file
--section SECTION name of section in model config file
--load_model LOAD_MODEL
path to model .sav
--descriptors DESCRIPTORS
descriptor of molecules
--output OUTPUT path to output directory
--model_config MODEL_CONFIG
path to config file
--n_bits N_BITS number of bits in Morgan fingerprint
--n_cv N_CV number of splits in RandomizedSearchCV
--n_iter N_ITER number of iterations in RandomizedSearchCV
--n_jobs N_JOBS number of jobs
--patience PATIENCE, -p PATIENCE
patience of fit
--gridsearch, -g use gridsearch
--metric {accuracy,roc_auc,f1,matthews}
metric for RandomizedSearchCV
--split_type {stratified,scaffold,random,cluster}
type of train-test split
--split_size SPLIT_SIZE size of test and valid splits
--targets TARGETS, -t TARGETS
set number of target column
--experiments_file EXPERIMENTS_FILE, -e EXPERIMENTS_FILE
where to write results of experiments
- Create or use a script from /moloi/bin/
- Run script.py with Python
Attention! Nested parallelization!
- Default set:
- run.py: n_jobs = 1
- experiments_table.csv: n_jobs = -1
Only for evaluation: - run.py: n_jobs = -1
- experiments_table.csv: n_jobs = 1
-
It is impossible to get RDKit and Mordred descriptors for some molecules, so the first experiment must be done with RDKit and Mordred descriptors (if you want to use them in the following experiments) to exclude the lost molecules from the dataset and other descriptors.
-
Fill in the table with parameters of experiments (examples in /etc, False = empty cell), UTF-8
-
Run run.py with Python
-
Experiments will be performed line by line with parameters from filled columns and with output to the result columns
python moloi/moloi.py --model_config '/data/model_configs/configs.ini' --descriptors ['rdkit', 'morgan','mordred', 'maccs'] --n_bits 2048 --n_cv 5 -p 100 -g --n_iter 300 --metric 'roc_auc' --split_type 'scaffold' --split_s 0.1 --select_model 'rf' --data_config '/data/data_configs/bace.ini' --section 'RF' -e 'etc/experiments_bace.csv' -t 0
Script adderss: run.py
Descriptors: ['rdkit', 'morgan', 'mordred', 'maccs']
n_bits: 2048
Config file: /data/model_configs/configs.ini
Section: RF
Grid search
Load train data
Load test data
Load val data
Data loaded
x_train shape: (1207, 4239)
x_test shape: (152, 4239)
x_val shape: (154, 4239)
y_train shape: (1207, 1)
y_test shape: (152, 1)
y_val shape: (154, 1)
GRID SEARCH
GRIDSEARCH FIT
MODEL FIT
EVALUATE
Accuracy test: 70.39%
0:07:37.644208
Creating report
Report complete, you can see it in the results folder
Results path: /tmp/2018-05-27_15:45:04_RF_['rdkit','morgan','mordred','maccs']70.395/
Done
After running the first experiment, the /tmp folder with the subfolders of the experiments will be created. In the experiment folder are:
- models/: copies of the folder with models
- run.py: copy of the experiment script
- results/: folder with model checkpoints (if Keras model)
- log: log of the experiment
- model.sav: the model
- addresses: text file with the address of model - its content allows to load the model in the experiment table
- n_cv: a text file with cross-validation indices
- gridsearch.csv: history of gridsearch (if gridsearch)
- y_pred_test.csv, y_pred_val.csv: predicted test and validation values
- img/: ROC AUC plot (if possible)
- img/: feature importance plot
- img/: gridsearch plots (one picture for each hyperparameter)
- img/: result plots - 3 plots shows t-SNE with correctly and incorrectly classified points (for all classes, for the positive class and for the negative class)
- report 70.39.pdf (accuracy in name): report with information about the experiment
- Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
- Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation, Computing in Science & Engineering, 13, 22-30 (2011), DOI:10.1109/MCSE.2011.37 (publisher link)
- Travis E. Oliphant. Python for Scientific Computing, Computing in Science & Engineering, 9, 10-20 (2007), DOI:10.1109/MCSE.2007.58 (publisher link)
- K. Jarrod Millman and Michael Aivazis. Python for Scientists and Engineers, Computing in Science & Engineering, 13, 9-12 (2011), DOI:10.1109/MCSE.2011.36 (publisher link)
- Fernando Pérez and Brian E. Granger. IPython: A System for Interactive Scientific Computing, Computing in Science & Engineering, 9, 21-29 (2007), DOI:10.1109/MCSE.2007.53 (publisher link)
- John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in Science & Engineering, 9, 90-95 (2007), DOI:10.1109/MCSE.2007.55 (publisher link)
- Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) (publisher link)
- O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
- https://github.com/gmum/ananas/blob/master/fingerprints/_desc_rdkit.py
- RDKit: Open-source cheminformatics; http://www.rdkit.org
- Keras (2015), Chollet et al., https://github.com/fchollet/keras