Supplementary code for "RaptGen: A variational autoencoder with profile hidden Markov model for generative aptamer discovery"
- Ubuntu == 18.04
- python == 3.7
- pytorch == 1.5.0
- cuda == 10.2
For other requirements, see Pipfile. Also We verified that the codes are runnable in the provided Docker environment (see Dockerfile). Built image is available at natuski/raptgen-gpu
on docker hub. The requirements are installable using pipenv with;
% pipenv install
The install time was about 10 minutes on MacbookPro 2020 Core i5 16G. You may also need to install cairo
library to generate profile hmm image. For mac OS X, it can be installed by brew install cairo && brew install pango
. For Ubuntu, sudo apt-get install -y libcairo2
would work.
All scripts have --help
command that prints the options and the arguments if required. For example,
% python scripts/multiple.py --help
Usage: multiple.py [OPTIONS]
run experiment with multiple motif
Options:
--n-motif INTEGER the number of motifs [default: 10]
--n-seq INTEGER the number of the sequence to generate [default:
10000]
--seed INTEGER seed for seqeunce generation reproduction [default:
0]
--error-rate FLOAT the ratio to modify sequence [default: 0.1]
--epochs INTEGER the number of training epochs [default: 1000]
--threshold INTEGER the number of epochs with no loss update to stop
training [default: 50]
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--save-dir PATH path to save results [default:
out/simlulation/multiple]
--reg-epochs INTEGER the number of epochs to conduct state transition
regularization [default: 50]
--help Show this message and exit. [default: False]
Visualized train logs look like;
% python3 scripts/real.py data/sample/sample.fasta
saving to /Users/niwn/raptgen/out/real
reading fasta format sequence
adapter info not provided. estimating value
estimated forward adapter len is 5 : AAAAA
estimated reverse adapter len is 5 : GGGGG
filtering with : AAAAA(5N)-20N-GGGGG(5N)
experiment name : 20211128_210830338899
# of sequences -> 100
[1] 139 itr 26.2 <-> 26.9 (25.8+ 1.1) of _vae.mdl..: 14%|█ | 13/100 [02:38<16:16, 11s/it]
The last line indicates the training status. The loss, iteration number, estimated time for training, etc., are shown.
[1] 139 itr 26.2 <-> 26.9 (25.8+ 1.1) of _vae.mdl..: 14%|█ | 13/100 [02:38<16:16, 11s/it]
^^^ ^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^ ^^^ ^^^^^^ ^^^^^^^^^^^ ^^^^^^
(1) (2) (3) (4) (5) (6) (7) (8)
- the number of epochs with no validation loss update.
- train loss
- valid (recon+norm.) loss
- model name
- training progress
- number of iteration
- elapsed time / estimate time of training
- training speed
To run raptgen with your sequence files, you have to run real.py
, which trains the model which encodes sequence into representation vector.
To run the experiment with sequence files, run real.py
;
% python3 scripts/real.py data/sample/sample.fasta
help of real.py
Usage: real.py [OPTIONS] SEQPATH
run experiment with real data
Options:
--epochs INTEGER the number of training epochs [default: 1000]
--threshold INTEGER the number of epochs with no loss update to stop
training [default: 50]
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--save-dir PATH path to save results [default: out/real]
--fwd TEXT forward adapter
--rev TEXT reverse adapter
--min-count INTEGER minimum duplication count to pass sequence for
training [default: 1]
--multi INTEGER the number of training for multiple times [default:
1]
--reg-epochs INTEGER the number of epochs to conduct state transition
regularization [default: 50]
--embed-size INTEGER the number of embedding dimension of raptgen model
[default: 2]
--fast / --normal [experimental] use fast calculation of probability
estimation. Output of the decoder shape is different
and the visualizers are not implemented. [default:
False]
--help Show this message and exit. [default: False]
.fa
, .fasta
, and .fastq
files are automatically processed. The default saving folder is out/simlulation/real
. The runtime depends on the sequence length and number of unique sequences. The output of this procedure is the followings;
- trained model :
[MODEL_NAME].mdl
, such ascnn_phmm_vae.mdl
- model loss transition:
[MODEL_NAME].csv
, such ascnn_phmm_var.csv
To embed the sequence, use encode.py
, which input sequences and trained model and output sequences' representation vector. While the VAE model encodes the sequence into the latent space in the form of distribution, the output representation vector is the center of this distribution.
Run;
% python3 scripts/encode.py \
data/sample/sample.fasta \
results/simulation/multiple/cnn_phmm_vae.mdl \
help of encode.py
Usage: encode.py [OPTIONS] SEQPATH MODELPATH
achieve sequence vector in embedded space.
Options:
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--fwd TEXT forward adapter
--rev TEXT reverse adapter
--save-dir PATH path to save results [default: out/encode]
--fast / --normal [experimental] use fast calculation of probability
estimation. Output of the decoder shape is different
and the visualizers are not implemented. [default:
False]
--help Show this message and exit. [default: False]
This will output sequences' representation vector in the following format;
index,seq,dim1,dim2
0,CGACATGGGCCGCCCAAGGA,0.14,0.08
1,GCGTACCGTAAATCTGTCGG,0.10,0.03
...
The default saving path is out/encode/embed_seq.csv
.
To reconstruct sequence from the latent space, use decode.py
. Given the model parameters and data points, the raptgen model would sample the most probable sequence from the derived profile HMM. Note that the model length has to be explicitly passed to the script to initialize the model.
% python3 scripts/decode.py \
out/encode/embed_seq.csv \
results/simulation/multiple/cnn_phmm_vae.mdl \
20
help of decode.py
Usage: decode.py [OPTIONS] POS_PATH MODEL_PATH TARGET_LEN
achieve sequence vector in embedded space.
Options:
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--save-dir PATH path to save results [default: out/decode]
--embed-dim INTEGER the embedding dimension of raptgen model [default:
2]
--eval-max INTEGER the maximum number of sequence to evaluate most
probable sequence [default: 256]
--help Show this message and exit. [default: False]
This will input csv with the identifier columns followed by dimension info;
index,dim1,dim2
0,0.14,0.08
1,0.1,0.03
...
and output reconstructed model and log probability of the sequence in the following format;
index,dim1,dim2,pattern,maximum_probable_sequence,log_proba
0,0.14,0.08,*C*T*ATCCCGCCCC,ACGTGATCCCGCCCC,-17.602188110351562
1,0.1,0.03,*C*T*ATCCCGCTGC,ACATGATCCCGCTGC,-16.477264404296875
...
The default saving path is out/decode/decode_output.csv
.
To select the center of the GMM populations, run;
% python3 scripts/gmm.py \
data/sample/sample.fasta \
data/sample/cnn_phmm_vae.mdl
help of gmm.py
Usage: gmm.py [OPTIONS] SEQPATH MODELPATH
select gmm center with trained model
Options:
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--save-dir PATH path to save results [default: out/gmm]
--fwd TEXT forward adapter
--rev TEXT reverse adapter
--help Show this message and exit. [default: False]
This will output the top 10 sequences to a specified directory (default out/gmm/gmm_seq.csv).
To conduct multipoint Bayesian optimization, run;
% python3 scripts/bo.py \
data/real/A_4R.fastq \
results/real/A_best.mdl \
results/real/A_evaled.csv
help of bo.py
Usage: bo.py [OPTIONS] SEQPATH MODELPATH EVALPATH
run Bayesian optimization with trained model and evaluated results
Options:
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--fwd TEXT forward adapter
--rev TEXT reverse adapter
--save-dir PATH path to save results [default: out/bo]
--help Show this message and exit. [default: False]
The evaluates sequences should only hold the random region, and each row should be written in [string],[value]
format.
AACGAGAGATGGTAGACCTATCTTTTAGCC,79.0
GTAGAGATTCTGAGGGTTCTCCTGCTATA,107.1
TTTTATAAAAAAGTGTTTAAAAAAGATTCA,-3.6
...
The result contains:
- The sequence is to be evaluated.
- The position of the motif embedding.
- The embedding of the most probable sequence (
re_
).
% cat out/bo/bo_seq.csv
bo_index,seq,x,y,re_x,re_y
0,GTAGAGATTCTGAGGGTTCTCCTGTTGACC,1.53,-0.13,1.60,-0.50
1,GTAGAGATTCTGAGGGTTCTCCTGTTGCCA,1.56,-0.58,1.62,-0.47
To run the experiment with multiple sequence motifs, run;
% python3 scripts/multiple.py
help of multiple.py
Usage: multiple.py [OPTIONS]
run experiment with multiple motif
Options:
--n-motif INTEGER the number of motifs [default: 10]
--n-seq INTEGER the number of the sequence to generate [default:
10000]
--seed INTEGER seed for seqeunce generation reproduction
[default: 0]
--error-rate FLOAT the ratio to modify sequence [default: 0.1]
--epochs INTEGER the number of training epochs [default: 1000]
--threshold INTEGER the number of epochs with no loss update to stop
training [default: 50]
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--save-dir PATH path to save results [default: /home/natsuki-
iwano/raptgen-xilorole/out/simlulation/multiple]
--reg-epochs INTEGER the number of epochs to conduct state transition
regularization [default: 50]
--multi INTEGER the number of training for multiple times
[default: 1]
--only-cnn / --all-models train all encoder types or not [default: False]
--help Show this message and exit. [default: False]
This outputs models ([MODEL_NAME].mdl
) and its training result ([MODEL_NAME].csv
) into specified folder (default is out/simlulation/multiple). A single run takes approximately 20 hours on Tesla V100 GPU.
To run the experiment with paired sequence motifs, run;
% python3 scripts/paired.py
help of paired.py
Usage: paired.py [OPTIONS]
run experiment with paired motif
Options:
--n-seq INTEGER the number of the sequence to generate [default:
5000]
--seed INTEGER seed for seqeunce generation reproduction
[default: 0]
--epochs INTEGER the number of training epochs [default: 1000]
--threshold INTEGER the number of epochs with no loss update to stop
training [default: 50]
--use-cuda / --no-cuda use cuda if available [default: True]
--cuda-id INTEGER the device id of cuda to run [default: 0]
--save-dir PATH path to save results [default: /home/natsuki-
iwano/raptgen-xilorole/out/simlulation/paired]
--reg-epochs INTEGER the number of epochs to conduct state transition
regularization [default: 50]
--multi INTEGER the number of training for multiple times
[default: 1]
--only-cnn / --all-models train all encoder types or not [default: False]
--help Show this message and exit. [default: False]
The default saving folder is out/simlulation/paired. A single run takes approximately 10 hours on Tesla V100 GPU.
.
├── data
│ ├── real
│ ├── sample
│ └── simulation
│ ├── multiple
│ └── paired
├── results
│ ├── real
│ └── simulation
│ ├── multiple
│ └── paired
├── scripts
└── src
├── data
├── models
└── visualization