11/11/2022 - This package is still actively updating! For more details, please check out our recent paper A comparative atlas of single-cell chromatin accessibility in the human brain (2023) Science 13;382(6667):eadf7044
To further understand how risk variants contribute to the function of regulatory elements, we used deep learning (DL) models to predict chromatin accessibility from DNA sequences. Our model was inspired by Enformer, which adapts the attention-based Transformer to better capture syntactic forms (e.g., order and combination of words in a sentence) in order to predict gene expression.
Here, our model utilizes a similar framework to predict epigenic information such as chromatin accessibility. Called Epiformer, our model takes (1) one-hot-encoded DNA sequences (A = [1, 0, 0, 0], C = [0, 1, 0, 0], G = [0, 0, 1, 0], T = [0, 0, 0, 1]) and (2) conservation scores (phastCons, range from 0 to 1) calculated from multiple species alignment as inputs, and predicts chromatin accessibility on ~100 kb genomic region at 128 bp resolution as output.
Epiformer rests on open-source machine learning framework, PyTorch, and contains three major parts:
- In total there are 6 residual convolutional blocks with max pooling (Conv Tower). We used a kernel size (4,15) for the 1st residual convolutional block, and a kernel size (4, 5) for the rest of residual convolutional block with padding. These two different kernels allow us to extract informative sequence features at lower and higher resolution, respectively. Batch normalization and Gaussian Error Linear Unit (GELU) activation function interleave each convolutional layer. The convolutional blocks with pooling reduce the spatial dimension from 98,304 bp to 768 so that each sequence position vector represents a bin of 128 bp.
- Transformer blocks including 4 multi-head attention layers is followed by the Conv Tower, which captures long-range combinations and orders of sequence features. To inject positional information, we add relative positional encodings. We used a dropout rate of 0.4 for the attention layer to avoid potential overfitting. In particular, we used SwiGLU, a GELU variant, as activation function.
- A pointwise convolution block is included to aggregate the weights from the last transformer layer and eventually output the predicted signals. Here we used GELU and SwiGLU activation functions with dropout rate of 0.1.
conda create -n epiformer python==3.9
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
conda install -c bioconda bedtools
pip install -r requirements.txt
- We create few directories to keep various datasets, models, and predictions.
- Download human genome (hg38) in fasta format, chromosome size, ENCODE blacklist regions, conservation score tracks from multiple species alignment (e.g. phastCons). These files are kept under directory
./genome
. - Generate ATAC-seq signal tracks using softwares like deepTools, put files under directory
./targets
- Provide peak calls (optional) under directory
./targets
. - Generate one tab delimited text file targets.txt listing the path to every target file, such as:
index identifier file clip scale sum_stat description
0 ASCT ASCT.norm_cov.bw 128 1 mean ASCT
1 OPC OPC.norm_cov.bw 128 1 mean OPC
2 OGC OGC.norm_cov.bw 128 1 mean OGC
3 MGC MGC.norm_cov.bw 128 1 mean MGC
4 VIP VIP.norm_cov.bw 128 1 mean VIP
5 LAMP5 LAMP5.norm_cov.bw 128 1 mean LAMP5
...
python bin/generate_datasets.py \
--target targets/human.targets.txt \
--seqcons genome/human.phastCons.bw \
--fa genome/human.fa \
--gsize genome/human.chrom.sizes.lite \
--blacklist genome/human.blacklist.bed.gz \
--seqlen 98304 \
--nchunk 5000 \
--binsize 128 \
-o datasets/human.mean100kSeg128.multitype.phastCons
Before training, please modify the hyperparameters in script train_signals.py. The current models were trained on 4 NVIDIA GeForce RTX 3090 (24GB) GPUs.
#--------------------------#
# Training hyperparameters #
num_epoches = 100
learning_rate = 1e-4
min_lr = 1e-6
batch_size = 8 # using 4 NVIDIA GeForce RTX 3090 GPUs
warmup_epoches = 10
max_iter_epoches = 100
accum_iter = 8 # gradient accumulation
#-----------------------#
# Model hyperparameters #
seq_len = 98304
bin_size = 128
num_cnn_layer = 6
max_len = 384 * 2
add_positional_encoding = True
embed_size = 384 * 2
num_heads = 4
att_dropout = 0.4
dim_feedforward = 2048
enc_dropout = 0.4
batch_first = True
num_encoder_layers = 8
crop_size = 64
ptw_dropout = 0.1
multiout_dim = 1 # 20
#recycle_count = 3
if_augment = True # data augmentation
if_cons = True # to use conservation score or not
penalty = 0 # give penalty to loss function or not
Here, we demonstrate the model training for microglia (MGC).
celltype_list=("ASCT" "OPC" "OGC" "MGC" "VIP" "LAMP5" "PVALB" "SST" "MSN" "FOXP2" "ITL23" "ITL4" "ITL5" "ITL6" "ITL6_2" "CT" "L6B" "NP")
i=3
typename=${celltype_list[${i}]}
path2dataset=datasets/
outprfx=model/${typename}
CUDA_VISIBLE_DEVICES=0,1,2,3 python bin/train_signals.py \
-i $path2dataset \
--index ${i} \
-o $outprfx &> $outprfx.train.log
The current models were trained on 4 NVIDIA GeForce RTX 3090 (24GB) GPUs. Three model will be saved from checkpoints with best correlation, best loss and from the last epoch.
Here, we demonstrate the prediction centered at TSS of gene BIN1 in microglia (MGC).
celltype_list=("ASCT" "OPC" "OGC" "MGC" "VIP" "LAMP5" "PVALB" "SST" "MSN" "FOXP2" "ITL23" "ITL4" "ITL5" "ITL6" "ITL6_2" "CT" "L6B" "NP")
i=3
typename=${celltype_list[${i}]}
seq_len=98304
bin_size=128
cell_index=${i} # MGC
species="human"
region="chr2:127068747-127150006" # centered at TSS of gene ***BIN1***
path2model=model/MGC.model-1663637197.best_loss.pth.tar
path2pred=pred/MGC
python epiformer/pred_signals.py -m $path2model \
-s $species -r $region \
-d ./ --index $cell_index --seqlen $seq_len --binsize ${bin_size} \
-o $path2pred.${species} &> $path2pred.${species}.pred_signals.log
The models pre-trained in this study: http://catlas.org/catlas_downloads/humanbrain/epiformer_data/.
We will keep updating these cell-type specific models.