Skip to content

Latest commit

 

History

History
73 lines (54 loc) · 3.71 KB

README.md

File metadata and controls

73 lines (54 loc) · 3.71 KB

dv-trio

dv-trio provides a pipeline to call variants for a trio (father-mother-child) using DeepVariants [1]. Genomic Variant Calling Files (gVCFs) created by DeepVariants are then co_called together using GATK[2]. The resultant trio VCF is then post-processing with FamSeq[3] to eliminate mendelian errors. The final output is a VCF with sample GT value representative of the FamSeq called genotype.

Installation

Clone this repository into your cloud instance and run the bash install_dependencies.sh script. This will install all dependencies onto your instance's PATH.

Usage

Usage:
       dv-trio.sh -i <input parameter file> -r <reference> -d <dbSNP VCF> [ -o <output directory name> ] [ -t <threshold> ] [ -b <bucket> ]

Post-processes trio calls made by DeepVariant to correct for Mendelian errors.

Required arguments:

  -i <input parameter file>   path to input file contain trio details. 
                              See input file creation section below for details
  -r <reference>              path to reference file. 
                              The directory holding the reference file need to contain the fa, fai and dict files
  -d <dbSNP VCF>              path to dbSNP VCF file. 
                                                        

Options:
  -o <output>     path to desired output directory (defaults to current directory)
  -t <threshold>  likelihood ratio cutoff threshold for mendelian error correction (float between 0 [use single individual based method] and 1 [use pedigree information], default is 1.0)
  -b <bucket>     S3 bucket path to write output to
  -h              this help message

Input Parameter File

A tab delimited text file contains details regarding the trio samples

  • Sample ID
  • Sample Bam location
  • Sample Gender (1 - male, 2 - female)

#Sample Sample_ID Sample_bam_location Sample_gender
CHILD    HG002  /home/ubuntu/GIAB_bams/HG002.GRCh38.60x.1.RG.bam  1
FATHER  HG003  /home/ubuntu/GIAB_bams/HG003.GRCh38.60x.1.RG.bam  1
MOTHER HG004  /home/ubuntu/GIAB_bams/HG004.GRCh38.60x.1.RG.bam  2

See template input file GIAB_trio_file.txt

Cloud instance recommendation

We were able to successfully run dv-trio for a WGS trio under the following machine condition.

Samples : Genome in a Bottle Consortium's AshkenazimTrio - HG002/HG003/HG004
Virtual Machine : AWS - Ubuntu Server 18.04 LTS (HVM), SSD Volume Type - 64-bit (x86)
Instance Type : Compute Optimized - C5.9xlarge - 36 vCPUs, 72GB Memory
Instance Storage : 1000GB (at least two times the size of the bam files size)

Application Note Details

For more detail on how to replicate the results shown in the application note please see the testing-README.md.

Citation

Eddie K K Ip, Clinton Hadinata, Joshua W K Ho, Eleni Giannoulatou
dv-trio: a family-based variant calling pipeline using DeepVariant
Bioinformatics, Volume 36, Issue 11, June 2020, Pages 3549–3551, https://doi.org/10.1093/bioinformatics/btaa116

References

  1. R. Poplin, P.-C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Dijamco, N. Nguyen, P. T. Afshar, et al. A universal snp and small-indel variant caller using deep neural networks. Nature biotechnology, 2018.

  2. M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A. Philippakis, G. Del Angel, M. A. Rivas, M. Hanna, et al. A framework for variation discovery and genotyping using next-generation dna sequencing data. Nature genetics, 43(5):491–498, 2011.

  3. G. Peng, Y. Fan, and W. Wang. Famseq: a variant calling program for familybased sequencing data using graphics processing units. PLoS computational biology, 10(10):e1003880, 2014.