The provided code snippet defines several functions used to download and process data from the ChEMBL database. Here's a breakdown of the process:
1. Function Definitions:
target_retrieval
: This function retrieves target information from ChEMBL based on a search query. It utilizes thechembl_webresource_client
library to interact with the ChEMBL web services.target_info_retrieval
: This function retrieves activity information for a specific target and assay type. It filters the results based on the provided target ID and assay type.clean
: This function cleans the downloaded activity data. It removes rows with missing values in specific columns and removes duplicates based on the unique SMILES (Simplified Molecular Input Line Entry System) string, a chemical structure representation.norm_value
: This function normalizes the standard activity values in the data. It handles potential errors like non-numeric values and caps the values at a specific limit.Negative_log
: This function calculates the negative log of the normalized activity values, which is a common way to convert potency values into a more usable format (pIC50).retriving_downloading_cleaning_preprocessing
: This function is the core function that retrieves, downloads, cleans, and preprocesses data for a given target and assay type. It performs the following steps:- Checks if a directory structure already exists for the target data.
- If the desired preprocessed data file (pIC50) doesn't exist, it performs the following sub-steps:
- Retrieves target information using
target_retrieval
. - Downloads assay information using
target_info_retrieval
. - Saves the raw assay data to a CSV file.
- Cleans the data using the
clean
function and saves the cleaned data to a CSV file. - Normalizes the data using the
norm_value
function and saves the normalized data to a CSV file. - Calculates the potency (pIC50) using the
Negative_log
function and saves the final preprocessed data (pIC50) to a CSV file.
- Retrieves target information using
- If the preprocessed data file already exists, it reads the data and description from the existing files.
download_multiple_targets
: This function downloads and processes data for a list of target IDs using theretriving_downloading_cleaning_preprocessing
function. It creates a summary table containing information about each processed target, including its ID, description, raw data length, and cleaned data length. Finally, it saves the summary table to a CSV file.
2. Overall Process:
The code defines helper functions for specific tasks like data retrieval, cleaning, and normalization. The retriving_downloading_cleaning_preprocessing
function serves as the workhorse, handling the entire data download and preprocessing workflow for a single target. Finally, the download_multiple_targets
function automates the process for a list of targets.
1. Importing Libraries:
pandas (pd)
: for data manipulationnumpy (np)
: for numerical computationsrdkit
: a cheminformatics library for working with moleculesword2vec
: (optional) for word embedding (if used)
2. Fingerprint Generation Functions:
generate_morgan_fingerprint
: Takes a molecule SMILES string and generates a Morgan fingerprint, a binary representation of the molecule's structure.generate_maccs_keys
: Takes a molecule SMILES string and generates MACCS keys, another fingerprint type.
3. Preprocessing Functions (with fingerprinting):
preprocessing_mf
: Applies thegenerate_morgan_fingerprint
function to a DataFrame containing molecule SMILES. It removes rows with missing SMILES and returns a NumPy array of fingerprints.preprocessing_mk
: Similar topreprocessing_mf
but uses thegenerate_maccs_keys
function for fingerprint generation.
4. Word Embedding Function (optional):
- This section includes commented-out functions for generating word embeddings from SMILES strings. It appears to use a pre-trained word2vec model but might not be currently active.
5. ADME Class:
- This class defines functions to calculate various properties of a molecule relevant to drug discovery, including:
- Lipinski's rule of five
- Egan druglikeness
- Ghose druglikeness
- Muegge filter
- Veber filter
- Brenk filter (checks for potentially problematic substructures)
- PAINS filter (checks for promiscuous compounds)
- Additionally, it provides functions to calculate several molecular descriptors like:
- Topological polar surface area (TPSA)
- LogP (partition coefficient)
- Molecular weight
- Molar refractivity
- Number of atoms, rings, carbons, heteroatoms, and rotatable bonds
- Hydrogen bond donors and acceptors
6. Molecular Descriptor Generation Function:
generate_molecular_descriptors
: Takes a molecule SMILES string, calculates various descriptors using the ADME class, and returns a NumPy array containing these values.
7. Preprocessing Function (with descriptors):
preprocessing_md
: Applies thegenerate_molecular_descriptors
function to a DataFrame containing molecule SMILES. It removes rows with missing SMILES and returns a NumPy array of the calculated descriptors.
The provided code snippet consists of several functions designed for training and evaluating machine learning models for predicting pIC50 values (potency) of chemical compounds. Here's a breakdown of the key parts:
1. Data Splitting and Cleaning:
splitting_and_post_cleaning
: This function splits the data into training and testing sets (80%/20% by default) and handles potential issues like infinite, negative infinite, or NaN values in the pIC50 labels. It replaces these problematic values with 0 and clips the remaining values to a maximum value.
2. Model Training and Evaluation:
train_and_evaluate_model
: This function trains a provided machine learning model on the training data and then evaluates its performance on the testing data. It calculates the R-squared (r2), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) as evaluation metrics.train_and_evaluate_model_kfold
: This function performs K-fold cross-validation for a more robust evaluation. It combines the training and testing sets, creates folds using KFold, and trains the model on each fold while evaluating on the remaining folds. It returns the average r2, RMSE, and MAE across all folds.
3. Training All Models:
train_and_evaluate_all_models
: This function iterates over a list of chemical datasets (defined elsewhere) and trains various machine learning models (KNN, PLS, SVR, etc.) on each dataset. It allows choosing the chemical representation type (e.g., Morgan Fingerprints (mf), ADME descriptors (md), MACCS keys (mk), or Word2vec embeddings (we)) for the training process. For each model-dataset combination, it attempts training and evaluation using K-fold cross-validation. The results (r2, RMSE, MAE) are stored in dictionaries and later converted to DataFrames. Finally, the DataFrames are saved to CSV files for each evaluation metric (r2, rmse, mae) and chosen chemical representation type.
4. Example Usage:
The provided comments showcase how to call the train_and_evaluate_all_models
function with a list of chemical targets (CHEMBL IDs) and the chemical representation type (in this case, "MACCS"). This would train and evaluate all the defined models using MACCS fingerprint representation for each chemical target in the list.