Skip to content

Commit

Permalink
Merge pull request #2 from kids-first/jw-init
Browse files Browse the repository at this point in the history
🎉   Initial commit
  • Loading branch information
wongjessica93 authored Aug 22, 2024
2 parents ff3aa33 + 63a814f commit de37049
Show file tree
Hide file tree
Showing 19 changed files with 1,282 additions and 16 deletions.
33 changes: 17 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
# KF Bixu Repository Template
# KFDRC PacBio HiFi WGS Variant Workflow

Use this template to bootstrap a new KF bixu repository
<p align="center">
<img src="https://github.com/d3b-center/d3b-research-workflows/raw/master/doc/kfdrc-logo-sm.png">
</p>

### Badges
This repository contains PacBio HiFi WGS Variant Pipeline used for the Kids First Data Resource Center (DRC).

Update the LICENSE badge to point to the new repo location on GitHub.
Note that the LICENSE badge will fail to render correctly unless the repo has
been set to **public**.
## Workflow

Add additional badges for CI, docs, and other integrations as needed within the
`<p>` tag next to the LICENSE.
The HiFi Human WGS Variant Workflow is designed to process PacBio HiFi sequencing data for WGS applications, including read alignment, variant calling, and phasing. This workflow has been converted to CWL from PacBio's [HiFi-human-WGS-WDL](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL) `sample_analysis` workflow.

### Repo Description
Workflow steps include:
- Read alignment
- Small variant calling
- Structural variant calling
- Phasing
- Coverage analysis
- CNV calling

Update the repositories description with a short summary of the repository's
intent.
Include an appropriate emoji at the start of the summary.

Add a handful of tags that summarize topics relating to the repository.
If the repo has a documentation site or webpage, add it next to the repository
description.
See our documentation for more details:
- [Documentation](./docs/SAMPLE_ANALYSIS_README.md)
- [CWL Workflow](./workflows/sample_analysis.cwl)
118 changes: 118 additions & 0 deletions docs/SAMPLE_ANALYSIS_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# KFDRC PacBio HiFi WGS Variant Workflow

<p align="center">
<img src="https://github.com/d3b-center/d3b-research-workflows/raw/master/doc/kfdrc-logo-sm.png">
</p>

The KFDRC PacBio HiFi WGS Variant Workflow performs read alignment, variant calling, and phasing. This CWL is a conversion of PacBio's [HiFi-human-WGS-WDL](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL) `sample_analysis.wdl`.

## Relevant Softwares and Versions

- [bcftools](https://github.com/samtools/bcftools): `1.14`
- [DeepVariant](https://github.com/google/deepvariant): `1.5.0`
- [HiFiCNV](https://github.com/PacificBiosciences/HiFiCNV): `0.1.7`
- [HiPhase](https://github.com/PacificBiosciences/HiPhase): `1.0.0`
- [mosdepth](https://github.com/brentp/mosdepth): `0.2.9`
- [paraphase](https://github.com/PacificBiosciences/paraphase): `2.2.3`
- [pb-CpG-tools](https://github.com/PacificBiosciences/pb-CpG-tools): `2.3.2`
- [pbmm2](https://github.com/PacificBiosciences/pbmm2): `1.10.0`
- [pbsv](https://github.com/PacificBiosciences/pbsv): `2.9.0`
- [trgt](https://github.com/PacificBiosciences/trgt): `0.5.0`

## Inputs

- Universal
Recommended:
- `bam`: Unaligned sample BAM.
- `reference_fasta`: Reference genome and index.
- `sample_id`: Used to name outputs.

- HiFiCNV
- `exclude_bed`: Compressed BED and index of regions to exclude from calling by HiFiCNV (recommended: [`cnv.excluded_regions.common_50.hg38.bed.gz`](https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/aux_data.md)).
- `expected_bed_female`: BED of expected copy number for female karyotype for HiFiCNV (recommended: `expected_cn.hg38.XX.bed`).
- `expected_bed_male`: BED of expected copy number for male karyotype for HiFiCNV (recommended: `expected_cn.hg38.XY.bed`).

- Tandem Repeat
- Recommended:
- `reference_tandem_repeat_bed`: Tandem repeat locations used by pbsv to normalize SV representation (recommended: `human_GRCh38_no_alt_analysis_set.trf.bed`).
- `trgt_tandem_repeat_bed`: Tandem repeat sites to be genotyped by TRGT (recommended: `human_GRCh38_no_alt_analysis_set.trgt.v0.3.4.bed`).
- Optional:
- `sex`: ["MALE", "FEMALE", null]. If the sex field is missing or null, sex will be set to unknown. Used to set the expected sex chromosome karyotype for TRGT and HiFiCNV (defaults to karyotype XX).

- DeepVariant
- Recommended:
- `model`: TensorFlow model checkpoint to use to evaluate candidate variant calls. Default is set to `PACBIO` for PacBio data.
- Optional:
- `custom_model`: Alternatively, a custom TensorFlow model checkpoint may be used to evaluate candidate variant calls. If not provided, the `model` trained by the DeepVariant team will be used.


A reference data bundle for this pipeline can be found [here](https://zenodo.org/records/8415406).
```bash
# download the reference data bundle
wget https://zenodo.org/records/8415406/files/wdl-humanwgs.v1.0.2.resource.tgz?download=1

# extract the reference data bundle and rename as dataset
tar -xzf wdl-humanwgs.v1.0.2.resource.tgz && mv static_resources PacBio_reference_bundle
```

## Outputs

- BAM stats and alignment
- `bam_stats`: TSV of length and quality for each read.
- `read_length_summary`: Read length distribution.
- `read_quality_summary`: Read quality distribution.
- `aligned_bam`: Aligned BAM.
- `svsig`: Structural variant signatures.

- Small variants
- `deepvariant_vcf`: Small variants (SNPs and INDELs < 50bp) VCF called by DeepVariant (with index).
- `deepvariant_gvcf`: Small variants (SNPs and INDELs < 50bp) gVCF called by DeepVariant (with index).
- `deepvariant_vcf_stats`: bcftools stats summary statistics for small variants.
- `deepvariant_roh_out`: Output of `bcftools roh` using `--AF-dflt 0.4`.
- `deepvariant_roh_bed`: Regions of homozygosity determiend by `bcftools roh` using `--AF-dflt 0.4`.

- Structural variants
- `pbsv_call_vcf`: Structural variants called by pbsv (with index).

- Phased variant calls and haplotagged alignments
- `phased_deepvariant_vcf`: Small variants called by DeepVariant and phased by HiPhase (with index).
- `phased_pbsv_vcf`: Structural variants called by pbsv and phased by HiPhase (with index).
- `phased_summary`: Phasing summary TSV file.
- `hiphase_stats`: Phase block summary statistics written by [HiPhase](https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#chromosome-summary-file---summary-file).
- `hiphase_blocks`: Phase block list written by [HiPhase](https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#phase-block-file---blocks-file).
- `hiphase_haplotags`: Per-read haplotag information, written by [HiPhase](https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#haplotag-file---haplotag-file).
- `hiphase_bam`: Aligned (by pbmm2), haplotagged (by HiPhase) reads (with index).
- `haplotagged_bam_mosdepth_summary`: mosdepth summary of median depths per chromosome.
- `haplotagged_bam_mosdepth_region_bed`: mosdepth BED of median coverage depth per 500 bp window.
- `paraphase_output_json`: Paraphase summary file.
- `paraphase_realigned_bam`: Realigned BAM for selected medically relevant genes in segmental duplications (with index).
- `paraphase_vcfs`: Phased Variant calls for selected medically relevant genes in segmental duplications.

- Tandem repeat information
- `trgt_spanning_reads`: Fragments of HiFi reads spanning loci genotyped by TRGT (with index).
- `trgt_repeat_vcf`: Tandem repeat genotypes from TRGT (with index).

- Methylation
- `cpg_pileup_beds`: 5mCpG site methylation probability pileups.
- `cpg_pileup_bigwigs`: 5mCpG site methylation probability pileups.

- CNVs
- `hificnv_vcf`: VCF output containing copy number variant calls for the sample from HiFiCNV.
- `hificnv_copynum_bedgraph`: Copy number values calculated for each region.
- `hificnv_depth_bw`: Bigwig file containing the depth measurements from HiFiCNV.
- `hificnv_maf_bw`: Bigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV.


## Estimated Run Times

We processed a 26.5 GB BAM file using the KFDRC PacBio HiFi WGS Variant Workflow with default settings on CAVATICA. Here are the details of the run:
- Run Time: 12 hours, 49 minutes
- Cost: $10.22


## Other Resources

- [HiFi-human-WGS-WDL](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL)
- [sample_analysis.wdl](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl)
- [Dockerfiles](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/tree/main?tab=readme-ov-file#tool-versions-and-docker-images)
- [GRCh38 reference data bundle](https://zenodo.org/records/8415406)
78 changes: 78 additions & 0 deletions subworkflows/deepvariant.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
cwlVersion: v1.2
class: Workflow
id: deepvariant
doc: |
Call variants using DeepVariant.
https://github.com/PacificBiosciences/wdl-common/blob/fef058b879d04c15c3da2626b320afdd8ace6c2e/wdl/workflows/deepvariant/deepvariant.wdl
requirements:
- class: MultipleInputFeatureRequirement
- class: SubworkflowFeatureRequirement
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement

inputs:
reference: { type: 'File', secondaryFiles: [{pattern: ".fai", required: false}, {pattern: ".gzi", required: false}], doc: "Genome reference FASTA to use. Must have an associated FAI index as well. Supports text or gzipped references. Should match the reference used to align the BAM file provided to the 'reads' input." }
reads: { type: 'File', secondaryFiles: [{pattern: "^.bai", required: false}, {pattern: ".bai", required: false}, {pattern: "^.crai", required: false}, {pattern: ".crai", required: false}], doc: "Aligned, sorted, indexed BAM/CRAM file containing the reads we want to call. Should be aligned to a reference genome compatible with the FASTA provided on the 'ref' input." }
num_shards: { type: 'int?', default: 32 }
sample_name: { type: 'string' }
# make_examples
make_examples_cpu: { type: 'int?', default: 36, doc: "CPUs to allocate to this task" }
make_examples_ram: { type: 'int?', default: 40, doc: "GB of RAM to allocate to this task." }
# call_variants
custom_model: { type: 'File?', secondaryFiles: [{pattern: "^.index", required: true}, {pattern: "^.meta", required: true}], doc: "Custom TensorFlow model checkpoint to use to evaluate candidate variant calls. If not provided, the model trained by the DeepVariant team will be used." }
model:
type:
- type: enum
symbols: ["WGS", "WES", "PACBIO", "HYBRID_PACBIO_ILLUMINA", "ONT_R104"]
doc: "TensorFlow model checkpoint to use to evaluate candidate variant calls."
default: "PACBIO"
call_variants_cpu: { type: 'int?', default: 36, doc: "CPUs to allocate to this task" }
call_variants_ram: { type: 'int?', default: 60, doc: "GB of RAM to allocate to this task." }
# postprocess_variants
qual_filter: { type: 'float?', doc: "Any variant with QUAL < qual_filter will be filtered in the VCF file." }
cnn_homref_call_min_gq: { type: 'float?', doc: "All CNN RefCalls whose GQ is less than this value will have ./. genotype instead of 0/0." }
multi_allelic_qual_filter: { type: 'float?', doc: "The qual value below which to filter multi-allelic variants." }
postprocess_variants_cpu: { type: 'int?', default: 36, doc: "CPUs to allocate to this task" }
postprocess_variants_ram: { type: 'int?', default: 40, doc: "GB of RAM to allocate to this task." }

outputs:
vcf: { type: 'File?', outputSource: postprocess_variants/output_vcf }
gvcf: { type: 'File?', outputSource: postprocess_variants/output_gvcf }

steps:
make_examples:
run: ../tools/deepvariant_make_examples.cwl
in:
reference: reference
reads: reads
cpu: make_examples_cpu
n_shards: num_shards
ram: make_examples_ram
out: [example_tfrecord_tar, nonvariant_site_tfrecord_tar]

call_variants:
run: ../tools/deepvariant_call_variants.cwl
in:
example_tfrecord_tar: make_examples/example_tfrecord_tar
custom_model: custom_model
model: model
cpu: call_variants_cpu
n_shards: num_shards
ram: call_variants_ram
out: [variants]

postprocess_variants:
run: ../tools/deepvariant_postprocess_variants.cwl
in:
variants: call_variants/variants
sample_name: sample_name
reference: reference
qual_filter: qual_filter
cnn_homref_call_min_gq: cnn_homref_call_min_gq
multi_allelic_qual_filter: multi_allelic_qual_filter
nonvariant_site_tfrecord: make_examples/nonvariant_site_tfrecord_tar
cpu: postprocess_variants_cpu
n_shards: num_shards
ram: postprocess_variants_ram
out: [output_vcf, output_gvcf]

51 changes: 51 additions & 0 deletions tools/bcftools.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
class: CommandLineTool
cwlVersion: v1.2
id: bcftools
doc: |
BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and
its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both
uncompressed and BGZF-compressed.

Original PacBio WDL: https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: ResourceRequirement
coresMin: $(inputs.threads)
ramMin: $(inputs.ram*1000)
- class: DockerRequirement
dockerPull: quay.io/pacbio/bcftools@sha256:46720a7ab5feba5be06d5269454a6282deec13060e296f0bc441749f6f26fdec
baseCommand: []
arguments:
- position: 0
shellQuote: false
valueFrom: |
bcftools stats \
--threads ${ return inputs.threads - 1 } \
--apply-filters PASS --samples $(inputs.sample_id) \
--fasta-ref $(inputs.reference.path) \
$(inputs.vcf.path) \
> $(inputs.sample_id).deepvariant.stats.txt

bcftools roh \
--threads ${ return inputs.threads - 1 } \
--AF-dflt 0.4 \
$(inputs.vcf.path) \
> $(inputs.sample_id).bcftools_roh.out

echo "#chr\\tstart\\tend\\tqual" > $(inputs.sample_id).bcftools_roh.bed
awk -v OFS='\t' '$1=="RG" {{ print $3, $4, $5, $8 }}' \
$(inputs.sample_id).bcftools_roh.out \
>> $(inputs.sample_id).bcftools_roh.bed

inputs:
vcf: { type: 'File' }
reference: { type: 'File' }
sample_id: { type: 'string' }
threads: { type: 'int?', default: 2, doc: "Number of threads to allocate to this task." }
ram: { type: 'int?', default: 4, doc: "GB size of RAM to allocate to this task." }

outputs:
vcf_stats: { type: 'File', outputBinding: { glob: '*.txt' } }
roh_out: { type: 'File', outputBinding: { glob: '*.out' } }
roh_bed: { type: 'File', outputBinding: { glob: '*.bed' } }
32 changes: 32 additions & 0 deletions tools/coverage_dropouts.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
class: CommandLineTool
cwlVersion: v1.2
id: coverage_dropouts
doc: |
Original PacBio WDL: https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: ResourceRequirement
coresMin: $(inputs.threads)
ramMin: $(inputs.ram*1000)
- class: DockerRequirement
dockerPull: quay.io/pacbio/trgt@sha256:8c9f236eb3422e79d7843ffd59e1cbd9b76774525f20d88cd68ca64eb63054eb
baseCommand: []
arguments:
- position: 0
shellQuote: false
valueFrom: |
check_trgt_coverage.py \
$(inputs.tandem_repeat_bed.path) \
$(inputs.bam.path) \
> $(inputs.sample_id).trgt.dropouts.txt

inputs:
bam: { type: 'File', secondaryFiles: [{pattern: ".bai", required: true}] }
tandem_repeat_bed: { type: 'File' }
sample_id: { type: 'string?' }
threads: { type: 'int?', default: 2, doc: "Number of threads to allocate to this task." }
ram: { type: 'int?', default: 4, doc: "GB size of RAM to allocate to this task." }

outputs:
trgt_dropouts: { type: 'File', outputBinding: { glob: '*.txt' } }
36 changes: 36 additions & 0 deletions tools/cpg_pileup.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
class: CommandLineTool
cwlVersion: v1.2
id: cpg_pileup
doc: |
Original PacBio WDL: https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/sample_analysis/sample_analysis.wdl
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: ResourceRequirement
coresMin: $(inputs.threads)
ramMin: $(inputs.threads*4000) # Uses ~4 GB memory / thread
- class: DockerRequirement
dockerPull: quay.io/pacbio/pb-cpg-tools@sha256:b95ff1c53bb16e53b8c24f0feaf625a4663973d80862518578437f44385f509b
baseCommand: []
arguments:
- position: 0
shellQuote: false
valueFrom: |
aligned_bam_to_cpg_scores \
--threads $(inputs.threads) \
--bam $(inputs.bam.path) \
--ref $(inputs.reference.path) \
--output-prefix $(inputs.sample_id) \
--min-mapq 1 \
--min-coverage 10 \
--model "$PILEUP_MODEL_DIR"/pileup_calling_model.v1.tflite

inputs:
bam: { type: 'File', secondaryFiles: [{pattern: ".bai", required: true}] }
reference: { type: 'File', secondaryFiles: [{pattern: ".fai", required: true}] }
sample_id: { type: 'string' }
threads: { type: 'int?', default: 12, doc: "Number of threads to allocate to this task." }

outputs:
pileup_beds: { type: 'File[]', outputBinding: { glob: '*.bed' } }
pileup_bigwigs: { type: 'File[]', outputBinding: { glob: '*.bw' } }
Loading

0 comments on commit de37049

Please sign in to comment.