This is the official codebase for the BERT-backboned model in BIOFORMERS: A SCALABLE FRAMEWORK FOR EXPLORING BIOSTATES USING TRANSFORMERS. The model is trained for the gene expression modeling task using the PBMC 4k + 8k datasets and the Adamson Perturbation dataset.
We recommend using venv
and pip
to install the required packages for Bioformers-BERT:
- Create a Python >= 3.9 virtual environment and activate;
- Clone the repository and
cd
inside; - Install required packages through
pip3 install -r requirements.txt
Before running the scripts, first adjust settings.json
to determine the specs for execution:
- To use the Adamson Perturbation dataset, set
"dataset_name"="adamson"
andlog_transform=false
. - To use the PBMC datasets, set
"dataset_name"="PBMC"
and'log_transform=true
. - You may adjust other settings such as normalization, tokenization binning, nonzero gene ratio in the mask, model dimensions, and training details through editing other variables. All results reported in the paper on the BERT-backboned model are reproducable through these settings.
Then, run the following commands for preprocessing, training, and evaluation.
python3 data-processing.py
python3 train-random-mask.py
python3 eval-random-mask.py /path/to/saved/checkpoint
We would like to express our gratitude to the developers these open-source projects which we utilized:
@article {Amara-Belgadi2023.11.29.569320,
author = {Siham Amara-Belgadi and Orion Li and David Yu Zhang and Ashwin Gopinath},
title = {BIOFORMERS: A SCALABLE FRAMEWORK FOR EXPLORING BIOSTATES USING TRANSFORMERS},
year = {2023},
doi = {10.1101/2023.11.29.569320},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/12/01/2023.11.29.569320},
eprint = {https://www.biorxiv.org/content/early/2023/12/01/2023.11.29.569320.full.pdf},
journal = {bioRxiv}
}