HEAL2 (Hierarchial Estimate for Agnostic Learning) is a deep learning-powered pipeline for analyzing rare genetic variants using a graph neural network (GNN) architecture integrated with an attention-readout mechanism coupled with a sparse autoencoder (SAE) for interpretability. It is tailored for binary phenotypic classification of large-scale whole-genome sequencing (WGS) studies and supports model training, evaluation, and gene / feature prioritization.
python3
pytorch
dgl
pandas
numpy
scikit-learn
- Clone the repository
- Create conda environment using the provided file:
conda env create -f environment.yml
- Activate the environment:
conda activate heal2
- Processed mutational burden per sample and gene
- Population data file (for population filtering)
- Phenotype file
- Gene-gene interaction (GGI) data
The pipeline consists of two main components:
- Linear Model (HEAL)
- Graph Neural Network Model (HEAL2)
python scripts/HEAL.py \
--data_path <path_to_data> \
--dataset <dataset_name> \
[--af <allele_frequency>] \
[--covariates <covariates_file>] \
[--logo] \
[--stratified_kfold] \
[--output <output_directory>]
python scripts/HEAL2.py \
--data_path <path_to_data> \
--dataset <dataset_name> \
[--af <allele_frequency>] \
[--covariates <covariates_file>] \
[--logo] \
[--stratified_kfold] \
[--output <output_directory>]
python scripts/HEAL2_attention.py \
--data_path <path_to_data> \
--dataset <dataset_name> \
[--af <allele_frequency>] \
[--covariates <covariates_file>] \
[--output <output_directory>]
- Implements linear model baseline
- Supports various cross-validation strategies
- Includes feature importance analysis
- Implements graph neural network model
- Supports various training configurations
- Includes comprehensive evaluation metrics
- Specialized script for attention score analysis and running on the full cohort
- Focuses on model interpretability and gene attention weights
Both HEAL and HEAL2 models support:
- Leave-one-group-out (LOGO) validation
- Stratified k-fold cross-validation
- Covariate inclusion
- Multiple evaluation metrics (AUROC, AUPRC)
- Feature importance analysis
The pipeline generates several output files:
- Model predictions
- Performance metrics
- Attention scores (HEAL2)
- Feature importance scores
- Validation results
- All scripts support extensive command-line arguments for customization
- Use
--help
with any script for detailed parameter information - GNN analysis requires additional gene-gene interaction data