Skip to content

arman-yekkehkhani/data_analysis

Repository files navigation

Data Analysis

This project contains Python scripts and Jupyter notebooks for analysis of publicly available target-disease associations datasets provided by the Open Targets team.

Project Structure

The overall structure of the project is as follows:


data_analysis
    │   README.md
    │   main.py
    │   tests.py
    │   download_datasets.py
    │   Journey.ipynb
    │   disease_target.json
    │   evidence_stats.json
    │
    └───datasets
            └───evidences
            └───diseases
            └───targets

The two Python scripts main.py and download_datasets.py are main scripts for downloading datasets and preforming analysis. tests.py include several test cases for sake of sanity check. Final results of the analysis is saved in evidence_stats.json and disease_target.json. The former contains Json objects of evidence statistics, while the later include disease-target pair stats. File Journey.ipynb contains a comprehensive description of my approach for this project. I tried to compare different options and justify my decisions.

How to run the script

In order to run the main.py script, you should follow these steps:

  1. Clone this repository and change your current working directory as follows
git clone https://github.com/arman-yekkehkhani/data_analysis
cd data_analysis
  1. Create a new Python virtual environment and install the dependencies from requirements.text. Here, I use Python >= 3.7 and pip as a package manager.
python3 venv -m env
source env/bin/activate
pip install -r requirements.txt
  1. run main.py with desired args.
python main.py

# removes directories containing the datasets in the current dir
# fetch datasets
python main.py --over-write true

Expected Output

The final output of each files is as follows:

1.evidence_stats.json

{
  "diseaseId": "EFO_0000095",
  "targetId": "ENSG00000082898",
  "median": 0.7,
  "top3": [
    0.7,
    0.7,
    0.7
  ]
}

2.disease_target.json

{
  "diseaseId": "EFO_0003847",
  "targetId": "ENSG00000284299",
  "median": 0.0
}

Another important result is the number of target-target pairs sharing a connection to at least two diseases, which is printed when main.py is finished.

Number of target-target pairs share a connection to at least two diseases : 364624, done in 3.24s

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages