The KFDRC PacBio HiFi WGS Variant Workflow performs read alignment, variant calling, and phasing. This CWL is a conversion of PacBio's HiFi-human-WGS-WDL sample_analysis.wdl
.
- bcftools:
1.14
- DeepVariant:
1.5.0
- HiFiCNV:
0.1.7
- HiPhase:
1.0.0
- mosdepth:
0.2.9
- paraphase:
2.2.3
- pb-CpG-tools:
2.3.2
- pbmm2:
1.10.0
- pbsv:
2.9.0
- trgt:
0.5.0
-
Universal Recommended:
bam
: Unaligned sample BAM.reference_fasta
: Reference genome and index.sample_id
: Used to name outputs.
-
HiFiCNV
exclude_bed
: Compressed BED and index of regions to exclude from calling by HiFiCNV (recommended:cnv.excluded_regions.common_50.hg38.bed.gz
).expected_bed_female
: BED of expected copy number for female karyotype for HiFiCNV (recommended:expected_cn.hg38.XX.bed
).expected_bed_male
: BED of expected copy number for male karyotype for HiFiCNV (recommended:expected_cn.hg38.XY.bed
).
-
Tandem Repeat
- Recommended:
reference_tandem_repeat_bed
: Tandem repeat locations used by pbsv to normalize SV representation (recommended:human_GRCh38_no_alt_analysis_set.trf.bed
).trgt_tandem_repeat_bed
: Tandem repeat sites to be genotyped by TRGT (recommended:human_GRCh38_no_alt_analysis_set.trgt.v0.3.4.bed
).
- Optional:
sex
: ["MALE", "FEMALE", null]. If the sex field is missing or null, sex will be set to unknown. Used to set the expected sex chromosome karyotype for TRGT and HiFiCNV (defaults to karyotype XX).
- Recommended:
-
DeepVariant
- Recommended:
model
: TensorFlow model checkpoint to use to evaluate candidate variant calls. Default is set toPACBIO
for PacBio data.
- Optional:
custom_model
: Alternatively, a custom TensorFlow model checkpoint may be used to evaluate candidate variant calls. If not provided, themodel
trained by the DeepVariant team will be used.
- Recommended:
A reference data bundle for this pipeline can be found here.
# download the reference data bundle
wget https://zenodo.org/records/8415406/files/wdl-humanwgs.v1.0.2.resource.tgz?download=1
# extract the reference data bundle and rename as dataset
tar -xzf wdl-humanwgs.v1.0.2.resource.tgz && mv static_resources PacBio_reference_bundle
-
BAM stats and alignment
bam_stats
: TSV of length and quality for each read.read_length_summary
: Read length distribution.read_quality_summary
: Read quality distribution.aligned_bam
: Aligned BAM.svsig
: Structural variant signatures.
-
Small variants
deepvariant_vcf
: Small variants (SNPs and INDELs < 50bp) VCF called by DeepVariant (with index).deepvariant_gvcf
: Small variants (SNPs and INDELs < 50bp) gVCF called by DeepVariant (with index).deepvariant_vcf_stats
: bcftools stats summary statistics for small variants.deepvariant_roh_out
: Output ofbcftools roh
using--AF-dflt 0.4
.deepvariant_roh_bed
: Regions of homozygosity determiend bybcftools roh
using--AF-dflt 0.4
.
-
Structural variants
pbsv_call_vcf
: Structural variants called by pbsv (with index).
-
Phased variant calls and haplotagged alignments
phased_deepvariant_vcf
: Small variants called by DeepVariant and phased by HiPhase (with index).phased_pbsv_vcf
: Structural variants called by pbsv and phased by HiPhase (with index).phased_summary
: Phasing summary TSV file.hiphase_stats
: Phase block summary statistics written by HiPhase.hiphase_blocks
: Phase block list written by HiPhase.hiphase_haplotags
: Per-read haplotag information, written by HiPhase.hiphase_bam
: Aligned (by pbmm2), haplotagged (by HiPhase) reads (with index).haplotagged_bam_mosdepth_summary
: mosdepth summary of median depths per chromosome.haplotagged_bam_mosdepth_region_bed
: mosdepth BED of median coverage depth per 500 bp window.paraphase_output_json
: Paraphase summary file.paraphase_realigned_bam
: Realigned BAM for selected medically relevant genes in segmental duplications (with index).paraphase_vcfs
: Phased Variant calls for selected medically relevant genes in segmental duplications.
-
Tandem repeat information
trgt_spanning_reads
: Fragments of HiFi reads spanning loci genotyped by TRGT (with index).trgt_repeat_vcf
: Tandem repeat genotypes from TRGT (with index).
-
Methylation
cpg_pileup_beds
: 5mCpG site methylation probability pileups.cpg_pileup_bigwigs
: 5mCpG site methylation probability pileups.
-
CNVs
hificnv_vcf
: VCF output containing copy number variant calls for the sample from HiFiCNV.hificnv_copynum_bedgraph
: Copy number values calculated for each region.hificnv_depth_bw
: Bigwig file containing the depth measurements from HiFiCNV.hificnv_maf_bw
: Bigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV.
We processed a 26.5 GB BAM file using the KFDRC PacBio HiFi WGS Variant Workflow with default settings on CAVATICA. Here are the details of the run:
- Run Time: 12 hours, 49 minutes
- Cost: $10.22