Genomic Annotations is a project aimed at annotating genomic data using data from various databases. The primary objective is to ensure fast annotation to keep up with AI model training speed.
To set up and run the project, follow these steps:
- Clone the repository:
git clone <repo_url>
- Install the required packages. Make sure you have Python 3.6 installed. Use the following command to install the dependencies:
pip install -r requirements.txt
- Download the necessary annotation files. The following files are required for each type of annotation:
Run the following command to initiate the database building process:
python3 build_cell_type_regulation_db.py <inputpath> <outputpath> <hg>
python3 build_classifications_db.py <inputpath> <outputpath> <hg>
python3 build_regulatory_regions_db.py <inputpath> <outputpath> <hg>
python3 build_methylation_db.py <inputpath> <outputpath> <hg>
Parameters in all cases:
-
inputpath
: The local path to cell_type_regulation.bed.gz. -
outputpath
: The local path to save the DB. -
hg
: {37, 38}. The desired reference genome. Use37
for hg37 and38
for hg38.
This commands will process the files and create the necessary databases for annotations in the path you provided.
python3 runtime_test_cell_type_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>
python3 runtime_test_classifications_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>
python3 runtime_test_regulatory_regions_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>
python3 runtime_test_methylation_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>
Parameters in all cases:
-
path
: The local path to the cell type regulation DB. -
hg
: {37, 38}. The desired reference genome. Use37
for hg37 and38
for hg38. -
numberofsamples
: The desired number of randomly generated samples on which to test the runtime. The default value is 1. -
outputforamt
: {'flat', 'matrix'}. The desired output format of the annotation. Use'flat'
for a one-dimensional feature vector and'matrix'
for a matrix with features for each nucleotide. The default value is'flat'
. -
sample
:-s chromosome start_pos end_pos flag
. (Optional) Specify a specific sample to annotate.