Skip to content

Workflows for germline short variant discovery with GATK4 optimized by Intel for on-premises infrastructure

Notifications You must be signed in to change notification settings

gatk-workflows/intel-gatk4-germline-snps-indels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

This repository contains a few different files - each tuned for certain requirements.

├── Exome_2T_PairedSingleSampleWf_optimized.inputs.json WES Throughput JSON file
├── Exome_56T_PairedSingleSampleWf_optimized.inputs.json WES Latency JSON file
├── Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl WES WDL optimized for on-prem ├── Latency_PairedSingleSampleWf_HT_384GB.json WGS Latency JSON file with HT on
├── Latency_PairedSingleSampleWf_NO_HT_384GB.json WGS Latency JSON file with HT off
├── Throughput_PairedSingleSampleWf_HT_384GB.json WGS Througphput JSON file with HT off
├── Throughput_PairedSingleSampleWf_NO_HT_384GB.json WGS Throughput JSON file with HT off
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl WGS WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl WGS WDL optimized for on-prem benchmarking
Modify the following lines in the WDL files to reflect the paths where datasets reside in your cluster:

In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Example: modify Latency_PairedSingleSampleWf_optimized.inputs.json for tools directory.

For improved throuput perfomance of WGS processing it is recomned uncomment the "backend" configuraoitn and setup 4 Cromwell Queues. 4 Queue aproach with cpu and memory level allocation support. Local: Run the first 3 basic tasks on local and seralize the workflows. BWA: Run BWA low priority on all nodes (let BWA run on 1/2 of the nodes untill their work is done) All: 1/2 nodes for everything else with high priority Haplo: 1/2 All node at mid priorty for Haplotype

DATASETS

The datasets used for the WGS workflow turning can be obtained from: https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/.

Contact Broad/Intel for access to the WES data needed for this workflow.

The other reference files and resource files can be downloaded from:

Datasets Recommended for Setup and Testing this workflow
Data Type  Filename  File Path
Reference
Genome
ref_dict  Homo_sapiens_assembly38.dict https://console.cloud.google.com/storage/browser/broad-references/hg38/v0
ref_fasta  Homo_sapiens_assembly38.fasta
ref_fasta_index  Homo_sapiens_assembly38.fasta.fai
ref_alt  Homo_sapiens_assembly38.fasta.64.alt
ref_sa  Homo_sapiens_assembly38.fasta.64.sa
ref_amb  Homo_sapiens_assembly38.fasta.64.amb
ref_bwt  Homo_sapiens_assembly38.fasta.64.bwt
ref_ann  Homo_sapiens_assembly38.fasta.64.ann
ref_pac  Homo_sapiens_assembly38.fasta.64.pac
contamination_sites_ud Homo_sapiens_assembly38.contam.UD
contamination_sites_bed Homo_sapiens_assembly38.contam.bed
contamination_sites_mu Homo_sapiens_assembly38.contam.mu
Resource
Files
dbSNP_vcf  Homo_sapiens_assembly38.dbsnp138.vcf
dbSNP_vcf_index  Homo_sapiens_assembly38.dbsnp138.vcf.idx
known_snps_sites_vcf Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
known_snps_sites_vcf_index Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
known_indels_sites_VCFs Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz
known_indels_sites_indices Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
Interval
Files
wgs_calling_interval_list  wgs_calling_regions.hg38.interval_list *SEE NOTE BELOW
wgs_coverage_interval_list  wgs_coverage_regions.hg38.interval_list
wgs_evaluation_interval_list  wgs_evaluation_regions.hg38.interval_list
Small Test
Input
Datasets
flowcell_unmapped_bams H06HDADXX130110.1.ATCACGAT.20k_reads.bam 

https://console.cloud.google.com/storage/browser/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/

H06HDADXX130110.2.ATCACGAT.20k_reads.bam
H06JUADXX130110.1.ATCACGAT.20k_reads.bam

NOTE: The Exome Interval file whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list is hosted at https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/.

TOOLS

For on-prem, the workflow uses non-dockerized tools:

GATK Version can be download from here: https://github.com/broadinstitute/gatk/releases
SAMTools can be downloaded from here: http://www.htslib.org/download/
Picard tool can be downloaded here: https://broadinstitute.github.io/picard/

About

Workflows for germline short variant discovery with GATK4 optimized by Intel for on-premises infrastructure

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages