This repo contains the code for our paper "Informative RNA-base embedding for functional RNA clustering and structural alignment". Please contact me at akiyama@dna.bio.keio.ac.jp for any question. Please cite this paper if you use our code or system output.
In this package, we provides resources including: source codes of the RNABERT model, pre-trained weights, prediction module.
Our code is written with python Python 3.6.5. Our code requires PyTorch version >= 1.4.0, biopython version >=1.76, and C++17 compatible compiler. Please follow the instructions here: https://github.com/pytorch/pytorch#installation. Also, please make sure you have at least one NVIDIA GPU.
(Required)
git clone https://github.com//RNABERT
cd RNABERT
python setup.py install
Pre-train consists of two tasks, MLM and SAL. The SAL tasks use family-specific multiple alignments for training. If you want to train with your own data, see the template data at /sample/mlm/ for MLM task and /sample/sal/ for SAL task. RNABERT requires that RNA sequences be represented in fasta format. All nucleotides are represented by A, U (T), G, C. You can download the data I used for the experiment from the link below.
The MLM task specifies the percentage of nucleotides to be masked "--maskrate" and the number of mask patterns "--mag". Adjust the batch size according to the memory size of your GPU.
export TRAIN_FILE=sample/mlm/sample.fa
export PRE_WEIGHT= #optional
export OUTPUT_WEIGHT=/path/to/output/weight
python MLM_SFP.py
--pretraining ${PRE_WEIGHT} \
--outputweight ${OUTPUT_WEIGHT} \
--data_mlm ${TRAIN_FILE} \
--epoch 10 \
--batch 40 \
--mag 3 \
--maskrate 0.2 \
The SAL task takes multiple alignments per family as input, and "--mag" can be used to specify how many pairwise alignments should be generated for a single sequence.
export TRAIN_FILE=sample/sal/sample.afa.txt
export PRE_WEIGHT= #optional
export OUTPUT_WEIGHT=/path/to/output/weight
python MLM_SFP.py
--pretraining ${PRE_WEIGHT} \
--outputweight ${OUTPUT_WEIGHT} \
--data_mul ${TRAIN_FILE} \
--epoch 10 \
--batch 40 \
--mag 5 \
Download the pre-trained model in to a directory. This model has been created using a full Rfam 14.3 dataset (~400nt).
After the model is fine-tuned, we can get predictions by running
export PRED_FILE=sample/aln/sample.raw.fa
export PRE_WEIGHT=/path/to/pretrained/weight
python MLM_SFP.py
--pretraining ${PRE_WEIGHT} \
--data_alignment ${PRED_FILE} \
--batch 40 \
--show_aln
To obtain the embedding vector for the RNA sequence, run
python MLM_SFP.py
--pretraining ${PRE_WEIGHT} \
--data_embedding ${PRED_FILE} \
--embedding_output ${OUTPUT_FILE} \
--batch 40 \