Skip to content

KerenRozen/genomic_annotations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genomic Annotations

Genomic Annotations is a project aimed at annotating genomic data using data from various databases. The primary objective is to ensure fast annotation to keep up with AI model training speed.

Setup and Installation

To set up and run the project, follow these steps:

  1. Clone the repository:
git clone <repo_url>
  1. Install the required packages. Make sure you have Python 3.6 installed. Use the following command to install the dependencies:
pip install -r requirements.txt
  1. Download the necessary annotation files. The following files are required for each type of annotation:
    1. 164 cell type regulation annotations:
      • hg37
      • hg38 was created manually from hg37.
    2. Classifications:
    3. Regulation regions:
    4. Methylation:

Building the annotations databases

Run the following command to initiate the database building process:

To build cell type regulation DB:

python3 build_cell_type_regulation_db.py <inputpath> <outputpath> <hg>

To build classifications DB:

python3 build_classifications_db.py <inputpath> <outputpath> <hg> 

To build regulation regions DB:

python3 build_regulatory_regions_db.py <inputpath> <outputpath> <hg> 

To build methylation DB:

python3 build_methylation_db.py <inputpath> <outputpath> <hg> 

Parameters in all cases:

  • inputpath: The local path to cell_type_regulation.bed.gz.

  • outputpath: The local path to save the DB.

  • hg: {37, 38}. The desired reference genome. Use 37 for hg37 and 38 for hg38.

This commands will process the files and create the necessary databases for annotations in the path you provided.

Usage

To test the 164 cell type regulation annotation speed, run:

python3 runtime_test_cell_type_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>

To test the classifications annotation speed, run:

python3 runtime_test_classifications_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>

To test the regulatory regions annotation speed, run:

python3 runtime_test_regulatory_regions_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>

To test the methylation annotation speed, run:

python3 runtime_test_methylation_annotation.py <path> <hg> <numberofsamples=1> <outputforamt=flat> <sample>

Parameters in all cases:

  • path: The local path to the cell type regulation DB.

  • hg: {37, 38}. The desired reference genome. Use 37 for hg37 and 38 for hg38.

  • numberofsamples: The desired number of randomly generated samples on which to test the runtime. The default value is 1.

  • outputforamt: {'flat', 'matrix'}. The desired output format of the annotation. Use 'flat' for a one-dimensional feature vector and 'matrix' for a matrix with features for each nucleotide. The default value is 'flat'.

  • sample: -s chromosome start_pos end_pos flag. (Optional) Specify a specific sample to annotate.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages