Merge pull request #2 from kids-first/jw-init

🎉 Initial commit
kids-first · Aug 22, 2024 · de37049 · de37049
2 parents ff3aa33 + 63a814f
commit de37049
Show file tree

Hide file tree

Showing 19 changed files with 1,282 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,23 @@
-# KF Bixu Repository Template
+# KFDRC PacBio HiFi WGS Variant Workflow
 
-Use this template to bootstrap a new KF bixu repository 
+<p align="center">
+  <img src="https://github.com/d3b-center/d3b-research-workflows/raw/master/doc/kfdrc-logo-sm.png">
+</p>
 
-### Badges
+This repository contains PacBio HiFi WGS Variant Pipeline used for the Kids First Data Resource Center (DRC).
 
-Update the LICENSE badge to point to the new repo location on GitHub.
-Note that the LICENSE badge will fail to render correctly unless the repo has
-been set to **public**.
+## Workflow
 
-Add additional badges for CI, docs, and other integrations as needed within the
-`<p>` tag next to the LICENSE.
+The HiFi Human WGS Variant Workflow is designed to process PacBio HiFi sequencing data for WGS applications, including read alignment, variant calling, and phasing. This workflow has been converted to CWL from PacBio's [HiFi-human-WGS-WDL](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL) `sample_analysis` workflow.
 
-### Repo Description
+Workflow steps include: 
+- Read alignment
+- Small variant calling
+- Structural variant calling
+- Phasing
+- Coverage analysis
+- CNV calling
 
-Update the repositories description with a short summary of the repository's
-intent.
-Include an appropriate emoji at the start of the summary.
-
-Add a handful of tags that summarize topics relating to the repository.
-If the repo has a documentation site or webpage, add it next to the repository
-description.
+See our documentation for more details: 
+- [Documentation](./docs/SAMPLE_ANALYSIS_README.md)
+- [CWL Workflow](./workflows/sample_analysis.cwl)
diff --git a/docs/SAMPLE_ANALYSIS_README.md b/docs/SAMPLE_ANALYSIS_README.md
@@ -0,0 +1,118 @@
+# KFDRC PacBio HiFi WGS Variant Workflow
+
+<p align="center">
+  <img src="https://github.com/d3b-center/d3b-research-workflows/raw/master/doc/kfdrc-logo-sm.png">
+</p>
+
+The KFDRC PacBio HiFi WGS Variant Workflow performs read alignment, variant calling, and phasing. This CWL is a conversion of PacBio's [HiFi-human-WGS-WDL](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL) `sample_analysis.wdl`. 
+
+ ## Relevant Softwares and Versions
+
+- [bcftools](https://github.com/samtools/bcftools): `1.14`
+- [DeepVariant](https://github.com/google/deepvariant): `1.5.0`
+- [HiFiCNV](https://github.com/PacificBiosciences/HiFiCNV): `0.1.7`
+- [HiPhase](https://github.com/PacificBiosciences/HiPhase): `1.0.0`
+- [mosdepth](https://github.com/brentp/mosdepth): `0.2.9`
+- [paraphase](https://github.com/PacificBiosciences/paraphase): `2.2.3`
+- [pb-CpG-tools](https://github.com/PacificBiosciences/pb-CpG-tools): `2.3.2`
+- [pbmm2](https://github.com/PacificBiosciences/pbmm2): `1.10.0`
+- [pbsv](https://github.com/PacificBiosciences/pbsv): `2.9.0`
+- [trgt](https://github.com/PacificBiosciences/trgt): `0.5.0`
+
+ ## Inputs
+
+- Universal
+    Recommended:
+    - `bam`: Unaligned sample BAM.
+    - `reference_fasta`: Reference genome and index.
+    - `sample_id`: Used to name outputs.
+
+- HiFiCNV
+    - `exclude_bed`: Compressed BED and index of regions to exclude from calling by HiFiCNV (recommended: [`cnv.excluded_regions.common_50.hg38.bed.gz`](https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/aux_data.md)).
+    - `expected_bed_female`: BED of expected copy number for female karyotype for HiFiCNV (recommended: `expected_cn.hg38.XX.bed`).
+    - `expected_bed_male`: BED of expected copy number for male karyotype for HiFiCNV (recommended: `expected_cn.hg38.XY.bed`).
+
+- Tandem Repeat
+  - Recommended:
+    - `reference_tandem_repeat_bed`: Tandem repeat locations used by pbsv to normalize SV representation (recommended: `human_GRCh38_no_alt_analysis_set.trf.bed`).
+    - `trgt_tandem_repeat_bed`: Tandem repeat sites to be genotyped by TRGT (recommended: `human_GRCh38_no_alt_analysis_set.trgt.v0.3.4.bed`).
+  - Optional:
+    - `sex`: ["MALE", "FEMALE", null]. If the sex field is missing or null, sex will be set to unknown. Used to set the expected sex chromosome karyotype for TRGT and HiFiCNV (defaults to karyotype XX).
+
+- DeepVariant
+  - Recommended:
+    - `model`: TensorFlow model checkpoint to use to evaluate candidate variant calls. Default is set to `PACBIO` for PacBio data.
+  - Optional: 
+    - `custom_model`: Alternatively, a custom TensorFlow model checkpoint may be used to evaluate candidate variant calls. If not provided, the `model` trained by the DeepVariant team will be used.
+
+
+A reference data bundle for this pipeline can be found [here](https://zenodo.org/records/8415406). 
+```bash
+# download the reference data bundle
+wget https://zenodo.org/records/8415406/files/wdl-humanwgs.v1.0.2.resource.tgz?download=1
+
+# extract the reference data bundle and rename as dataset
+tar -xzf wdl-humanwgs.v1.0.2.resource.tgz && mv static_resources PacBio_reference_bundle
+```
+
+ ## Outputs
+
+- BAM stats and alignment
+    - `bam_stats`: TSV of length and quality for each read.
+    - `read_length_summary`: Read length distribution.
+    - `read_quality_summary`: Read quality distribution.
+    - `aligned_bam`: Aligned BAM.
+    - `svsig`: Structural variant signatures. 
+
+- Small variants
+    - `deepvariant_vcf`: Small variants (SNPs and INDELs < 50bp) VCF called by DeepVariant (with index).
+    - `deepvariant_gvcf`: Small variants (SNPs and INDELs < 50bp) gVCF called by DeepVariant (with index).
+    - `deepvariant_vcf_stats`: bcftools stats summary statistics for small variants.
+    - `deepvariant_roh_out`: Output of `bcftools roh` using `--AF-dflt 0.4`.
+    - `deepvariant_roh_bed`: Regions of homozygosity determiend by `bcftools roh` using `--AF-dflt 0.4`.
+
+- Structural variants
+    - `pbsv_call_vcf`: Structural variants called by pbsv (with index).
+
+- Phased variant calls and haplotagged alignments
+    - `phased_deepvariant_vcf`: Small variants called by DeepVariant and phased by HiPhase (with index).
+    - `phased_pbsv_vcf`: Structural variants called by pbsv and phased by HiPhase (with index).
+    - `phased_summary`: Phasing summary TSV file.
+    - `hiphase_stats`: Phase block summary statistics written by [HiPhase](https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#chromosome-summary-file---summary-file).
+    - `hiphase_blocks`: Phase block list written by [HiPhase](https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#phase-block-file---blocks-file).
+    - `hiphase_haplotags`: Per-read haplotag information, written by [HiPhase](https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#haplotag-file---haplotag-file).
+    - `hiphase_bam`: Aligned (by pbmm2), haplotagged (by HiPhase) reads (with index).
+    - `haplotagged_bam_mosdepth_summary`: mosdepth summary of median depths per chromosome. 
+    - `haplotagged_bam_mosdepth_region_bed`: mosdepth BED of median coverage depth per 500 bp window.
+    - `paraphase_output_json`: Paraphase summary file.
+    - `paraphase_realigned_bam`: Realigned BAM for selected medically relevant genes in segmental duplications (with index).
+    - `paraphase_vcfs`: Phased Variant calls for selected medically relevant genes in segmental duplications.
+
+- Tandem repeat information
+    - `trgt_spanning_reads`: Fragments of HiFi reads spanning loci genotyped by TRGT (with index).
+    - `trgt_repeat_vcf`: Tandem repeat genotypes from TRGT (with index).
+
+- Methylation
+    - `cpg_pileup_beds`: 5mCpG site methylation probability pileups.
+    - `cpg_pileup_bigwigs`: 5mCpG site methylation probability pileups.
+
+- CNVs
+    - `hificnv_vcf`: VCF output containing copy number variant calls for the sample from HiFiCNV.
+    - `hificnv_copynum_bedgraph`: Copy number values calculated for each region. 
+    - `hificnv_depth_bw`: Bigwig file containing the depth measurements from HiFiCNV.
+    - `hificnv_maf_bw`: Bigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV.
+
+
+ ## Estimated Run Times
+
+We processed a 26.5 GB BAM file using the KFDRC PacBio HiFi WGS Variant Workflow with default settings on CAVATICA. Here are the details of the run:
+- Run Time: 12 hours, 49 minutes
+- Cost: $10.22
+
+
+## Other Resources
+
+- [HiFi-human-WGS-WDL](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL)
+- [sample_analysis.wdl](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl)
+- [Dockerfiles](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/tree/main?tab=readme-ov-file#tool-versions-and-docker-images)
+- [GRCh38 reference data bundle](https://zenodo.org/records/8415406)
diff --git a/subworkflows/deepvariant.cwl b/subworkflows/deepvariant.cwl
@@ -0,0 +1,78 @@
+cwlVersion: v1.2
+class: Workflow
+id: deepvariant
+doc: | 
+  Call variants using DeepVariant. 
+  https://github.com/PacificBiosciences/wdl-common/blob/fef058b879d04c15c3da2626b320afdd8ace6c2e/wdl/workflows/deepvariant/deepvariant.wdl
+requirements:
+  - class: MultipleInputFeatureRequirement
+  - class: SubworkflowFeatureRequirement
+  - class: InlineJavascriptRequirement
+  - class: StepInputExpressionRequirement
+
+inputs:
+  reference: { type: 'File', secondaryFiles: [{pattern: ".fai", required: false}, {pattern: ".gzi", required: false}], doc: "Genome reference FASTA to use. Must have an associated FAI index as well. Supports text or gzipped references. Should match the reference used to align the BAM file provided to the 'reads' input." }
+  reads: { type: 'File', secondaryFiles: [{pattern: "^.bai", required: false}, {pattern: ".bai", required: false}, {pattern: "^.crai", required: false}, {pattern: ".crai", required: false}], doc: "Aligned, sorted, indexed BAM/CRAM file containing the reads we want to call. Should be aligned to a reference genome compatible with the FASTA provided on the 'ref' input." }
+  num_shards: { type: 'int?', default: 32 }
+  sample_name: { type: 'string' }
+  # make_examples
+  make_examples_cpu: { type: 'int?', default: 36, doc: "CPUs to allocate to this task" }
+  make_examples_ram: { type: 'int?', default: 40, doc: "GB of RAM to allocate to this task." }
+  # call_variants
+  custom_model: { type: 'File?', secondaryFiles: [{pattern: "^.index", required: true}, {pattern: "^.meta", required: true}], doc: "Custom TensorFlow model checkpoint to use to evaluate candidate variant calls. If not provided, the model trained by the DeepVariant team will be used." }
+  model:
+    type:
+      - type: enum
+        symbols: ["WGS", "WES", "PACBIO", "HYBRID_PACBIO_ILLUMINA", "ONT_R104"]
+    doc: "TensorFlow model checkpoint to use to evaluate candidate variant calls."
+    default: "PACBIO"
+  call_variants_cpu: { type: 'int?', default: 36, doc: "CPUs to allocate to this task" }
+  call_variants_ram: { type: 'int?', default: 60, doc: "GB of RAM to allocate to this task." }
+  # postprocess_variants
+  qual_filter: { type: 'float?', doc: "Any variant with QUAL < qual_filter will be filtered in the VCF file." }
+  cnn_homref_call_min_gq: { type: 'float?', doc: "All CNN RefCalls whose GQ is less than this value will have ./. genotype instead of 0/0." }
+  multi_allelic_qual_filter: { type: 'float?', doc: "The qual value below which to filter multi-allelic variants." }
+  postprocess_variants_cpu: { type: 'int?', default: 36, doc: "CPUs to allocate to this task" }
+  postprocess_variants_ram: { type: 'int?', default: 40, doc: "GB of RAM to allocate to this task." }
+
+outputs:
+  vcf: { type: 'File?', outputSource: postprocess_variants/output_vcf }
+  gvcf: { type: 'File?', outputSource: postprocess_variants/output_gvcf }
+
+steps:
+  make_examples:
+    run: ../tools/deepvariant_make_examples.cwl
+    in: 
+      reference: reference
+      reads: reads
+      cpu: make_examples_cpu
+      n_shards: num_shards
+      ram: make_examples_ram
+    out: [example_tfrecord_tar, nonvariant_site_tfrecord_tar]
+
+  call_variants:
+    run: ../tools/deepvariant_call_variants.cwl
+    in: 
+      example_tfrecord_tar: make_examples/example_tfrecord_tar
+      custom_model: custom_model
+      model: model
+      cpu: call_variants_cpu
+      n_shards: num_shards
+      ram: call_variants_ram
+    out: [variants]
+
+  postprocess_variants:
+    run: ../tools/deepvariant_postprocess_variants.cwl
+    in:
+      variants: call_variants/variants
+      sample_name: sample_name
+      reference: reference
+      qual_filter: qual_filter
+      cnn_homref_call_min_gq: cnn_homref_call_min_gq
+      multi_allelic_qual_filter: multi_allelic_qual_filter
+      nonvariant_site_tfrecord: make_examples/nonvariant_site_tfrecord_tar
+      cpu: postprocess_variants_cpu
+      n_shards: num_shards
+      ram: postprocess_variants_ram
+    out: [output_vcf, output_gvcf]
+
diff --git a/tools/bcftools.cwl b/tools/bcftools.cwl
@@ -0,0 +1,51 @@
+class: CommandLineTool
+cwlVersion: v1.2
+id: bcftools
+doc: |
+  BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and 
+  its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both 
+  uncompressed and BGZF-compressed.
+
+  Original PacBio WDL: https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl
+requirements:
+  - class: ShellCommandRequirement
+  - class: InlineJavascriptRequirement
+  - class: ResourceRequirement
+    coresMin: $(inputs.threads)
+    ramMin: $(inputs.ram*1000)
+  - class: DockerRequirement
+    dockerPull: quay.io/pacbio/bcftools@sha256:46720a7ab5feba5be06d5269454a6282deec13060e296f0bc441749f6f26fdec
+baseCommand: []
+arguments:
+  - position: 0
+    shellQuote: false
+    valueFrom: |
+      bcftools stats \
+        --threads ${ return inputs.threads - 1 } \
+        --apply-filters PASS --samples $(inputs.sample_id) \
+        --fasta-ref $(inputs.reference.path) \
+        $(inputs.vcf.path) \
+      > $(inputs.sample_id).deepvariant.stats.txt
+
+      bcftools roh \
+        --threads ${ return inputs.threads - 1 } \
+        --AF-dflt 0.4 \
+        $(inputs.vcf.path) \
+      > $(inputs.sample_id).bcftools_roh.out
+
+      echo "#chr\\tstart\\tend\\tqual" > $(inputs.sample_id).bcftools_roh.bed
+      awk -v OFS='\t' '$1=="RG" {{ print $3, $4, $5, $8 }}' \
+        $(inputs.sample_id).bcftools_roh.out \
+      >> $(inputs.sample_id).bcftools_roh.bed
+
+inputs:
+  vcf: { type: 'File' }
+  reference: { type: 'File' }
+  sample_id: { type: 'string' }
+  threads: { type: 'int?', default: 2, doc: "Number of threads to allocate to this task." }
+  ram: { type: 'int?', default: 4, doc: "GB size of RAM to allocate to this task." }
+
+outputs:
+  vcf_stats: { type: 'File', outputBinding: { glob: '*.txt' } }
+  roh_out: { type: 'File', outputBinding: { glob: '*.out' } }
+  roh_bed: { type: 'File', outputBinding: { glob: '*.bed' } }
diff --git a/tools/coverage_dropouts.cwl b/tools/coverage_dropouts.cwl
@@ -0,0 +1,32 @@
+class: CommandLineTool
+cwlVersion: v1.2
+id: coverage_dropouts
+doc: |
+  Original PacBio WDL: https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl
+requirements:
+  - class: ShellCommandRequirement
+  - class: InlineJavascriptRequirement
+  - class: ResourceRequirement
+    coresMin: $(inputs.threads)
+    ramMin: $(inputs.ram*1000)
+  - class: DockerRequirement
+    dockerPull: quay.io/pacbio/trgt@sha256:8c9f236eb3422e79d7843ffd59e1cbd9b76774525f20d88cd68ca64eb63054eb
+baseCommand: []
+arguments:
+  - position: 0
+    shellQuote: false
+    valueFrom: |
+      check_trgt_coverage.py \
+        $(inputs.tandem_repeat_bed.path) \
+        $(inputs.bam.path) \
+      > $(inputs.sample_id).trgt.dropouts.txt
+
+inputs:
+  bam: { type: 'File', secondaryFiles: [{pattern: ".bai", required: true}] }
+  tandem_repeat_bed: { type: 'File' }
+  sample_id: { type: 'string?' }
+  threads: { type: 'int?', default: 2, doc: "Number of threads to allocate to this task." }
+  ram: { type: 'int?', default: 4, doc: "GB size of RAM to allocate to this task." }
+
+outputs:
+  trgt_dropouts: { type: 'File', outputBinding: { glob: '*.txt' } }
diff --git a/tools/cpg_pileup.cwl b/tools/cpg_pileup.cwl
@@ -0,0 +1,36 @@
+class: CommandLineTool
+cwlVersion: v1.2
+id: cpg_pileup
+doc: |
+  Original PacBio WDL: https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl
+requirements:
+  - class: ShellCommandRequirement
+  - class: InlineJavascriptRequirement
+  - class: ResourceRequirement
+    coresMin: $(inputs.threads)
+    ramMin: $(inputs.threads*4000) # Uses ~4 GB memory / thread
+  - class: DockerRequirement
+    dockerPull: quay.io/pacbio/pb-cpg-tools@sha256:b95ff1c53bb16e53b8c24f0feaf625a4663973d80862518578437f44385f509b
+baseCommand: []
+arguments:
+  - position: 0
+    shellQuote: false
+    valueFrom: |
+      aligned_bam_to_cpg_scores \
+        --threads $(inputs.threads) \
+        --bam $(inputs.bam.path) \
+        --ref $(inputs.reference.path) \
+        --output-prefix $(inputs.sample_id) \
+        --min-mapq 1 \
+        --min-coverage 10 \
+        --model "$PILEUP_MODEL_DIR"/pileup_calling_model.v1.tflite
+
+inputs:
+  bam: { type: 'File', secondaryFiles: [{pattern: ".bai", required: true}] }
+  reference: { type: 'File', secondaryFiles: [{pattern: ".fai", required: true}] }
+  sample_id: { type: 'string' }
+  threads: { type: 'int?', default: 12, doc: "Number of threads to allocate to this task." }
+
+outputs:
+  pileup_beds: { type: 'File[]', outputBinding: { glob: '*.bed' } }
+  pileup_bigwigs: { type: 'File[]', outputBinding: { glob: '*.bw' } }