variant-spark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets.
Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures to rank SNPs according to their predictive power. Although there are number of existing random forest implementations available, some even parallel or distributed such as: Random Jungle, ranger or SparkML, most of them are not optimized to deal with GWAS datasets, which usually come with thousands of samples and millions of variables.
variant-spark currently provides the basic functionality of building random forest model and estimating variable importance with mean decrease gini method and can operate on VCF and CSV files. Future extensions will include support of other importance measures, variable selection methods and data formats.
variant-spark utilizes a novel approach of building random forest from data in transposed representation, which allows it to efficiently deal with even extremely wide GWAS datasets. Moreover, since the most common genomics variant calls VCF and uses the transposed representation, variant-spark can work directly with the VCF data, without the costly pre-processing required by other tools.
variant-spark is built on top of Apache Spark – a modern distributed framework for big data processing, which gives variant-spark the ability to to scale horizontally on both bespoke cluster and public clouds.
The potential users include:
- Medical researchers seeking to perform GWAS-like analysis on large cohort data of genome wide sequencing data or imputed SNP array data.
- Medical researchers or clinicians seeking to perform clustering on genomic profiles to stratify large-cohort genomic data
- General researchers with classification or clustering needs of datasets with millions of features.
Please feel free to add issues and/or upvote issues you care about. Also join the Gitter chat. We also started ReadTheDocs and there is always the this repo's issues page for you to add requests. Thanks for your support.
To learn more watch this video from YOW! Brisbane 2017.
Bicep is free and supported by Microsoft support and is fun, easy, and productive way to build and deploy complex infrastructure on Azure. If you are currently using ARM you will love Bicep simple syntax. Bicep also support declaring existing resources. More resources available at this Link
- Managed Identity needs to be enabled as a resource provider inside Azure
- For the bash script,
jq
must be installed.
To clone and run this repo, you'll need Git, Bicep and azure-cli installed on your computer. Strongly recommend to use vs code to edit the file with bicep extension installed (instructions) for intellisense and other completions. From your command line:
Click on the above link to deploy the template. This will take you to your azure subscription and ask you to fill out certain parameters. Once completed the entire infrastructure will be created along with databricks workspace which can be used to run the default notebook.
If you need to customize the template you can use the following command:
# Clone this repository
$ git clone https://github.com/aehrc/VariantSpark-Azure-deployment.git
# Go into the repository
$ cd variant-databricks
# Update main.bicep file with variables as required. Default is for southeastasia region.
# Refer to Azure Databricks UDR section under References for region specific parameters.
$ code main.bicep
# Run the build shell script to create the resources
$ ./build.sh
Note: Build script assume Linux environment, If you're using Windows, see this guide on running Linux
This template is based on ARM templates from the below repo: