pipeline-call-sSNV

call-sSNV

Overview

The call-sSNV nextflow pipeline performs somatic SNV calling given a pair of tumor/normal BAM files. Four somatic SNV callers are available: SomaticSniper, Strelka2, Mutect2 and MuSE. The user may request one or more callers, and each caller produces an independently generated filtered VCF file.

If two or more callers are requested, additional output includes both a VCF and an MAF file with the set of SNVs shared by two or more callers, and a Venn Diagram showing counts of shared and private SNVs.

SomaticSniper, Strelka2, and MuSE require there to be exactly one pair of input tumor/normal BAM files, but Mutect2 will take tumor-only input (no paired normal), as well as tumor/normal BAM pairs for multiple samples from the same individual.

Somatic SNV callers:

SomaticSniper is an older tool yielding high specificity single nucleotide somatic variants.
Strelka2 uses candidate indels from Manta and calls somatic short mutations (single nucleotide and small indel), filtered with a random forest model.
GATK Mutect2 calls somatic short mutations via local assembly of haplotypes.
MuSE accounts for tumor heterogeneity and calls single nucleotide somatic variants.

How To Run

Below is a summary of how to run the pipeline. See here for more information on running Nextflow pipelines.

Note: Because this pipeline uses an image stored in the GitHub Container Registry, you must follow the steps listed in the Docker Introduction on Confluence to set up a PAT for your GitHub account and log into the registry on the cluster before running this pipeline.

The recommended way of running the pipeline is to directly use the source code located here: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/release/, rather than cloning a copy of the pipeline.
- The source code should never be modified when running our pipelines
Copy and edit the input config file

Make sure the reference .fa file in config file matches the reference genome in the input BAM files.

Copy and edit the input YAML
The pipeline can be executed locally using the command below:

nextflow run path/to/main.nf -config path/to/input.config -params-file input.yaml`

For example,

path/to/main.nf could be: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/release/5.0.0/main.nf
path/to/input.config is the path to where you saved your project-specific copy of template.config
path/to/input.yaml is the path to where you saved your project-specific copy of template.yaml

To submit to UCLAHS-CDS's Azure cloud, use the submission script here with the command below:

python path/to/submit_nextflow_pipeline.py \
    --nextflow_script path/to/main.nf \
    --nextflow_config path/to/input.config\
    --nextflow_yaml path/to/input.yaml \
    --pipeline_run_name <run_name> \
    --partition_type F72 \
    --email jdoe@ucla.edu

Note: Although --partition_type F2 is an available option for small data sets, Mutect2 and Muse will fail due to lack of memory.

Flow Diagrams - Variant Calling

SomaticSniper

Strelka2

Mutect2

MuSE

Flow Diagrams - Variant Intersection

BCFtools intersection and VennDiagram visualization using R

Pipeline Steps - Variant Calling

SomaticSniper

1. `SomaticSniper` v1.0.5.0

Compare a pair of tumor and normal BAM files and output an unfiltered list of single nucleotide positions that are different between tumor and normal, in VCF format.

2. Filter out ambiguous positions.

This takes several steps, listed below, and starts with the same input files given to SomaticSniper. These are used to generate a list of high confidence indels to assist SNV filtering.

a. Get indel pileup summaries

Summarize counts of reads that support reference, alternate and other alleles for given sites. This is done for both of the input BAM files and the results are used in the next step.

b. Filter indel pileup outputs

Use samtools.pl varFilter to filter each pileup output (tumor and normal), then further filter each to keep only indels with QUAL > 20. samtools.pl is packaged with SomaticSniper.

c. Filter SomaticSniper VCF

Use snpfilter.pl (packaged with SomaticSniper): i. filter VCF using normal indel pileup (from step b). ii. filter VCF output from step i using tumor indel pileup (from step b).

d. Summarize alignment information for retained SNV positions

Extract positions from filtered VCF file and use with bam-readcount to generate a summary of read alignment metrics for each position.

e. Final filtering of SNVs using metrics summarized above

Use fpfilter.pl and highconfidence.pl (packaged with SomaticSniper), resulting in a final high confidence VCF file.

Strelka2

1. `Manta` v1.6.0

The input pair of tumor/normal BAM files are used by Manta to produce candidate small indels via the Manta somatic configuration protocol. Note, larger (structural) variants are also produced and can be retrieved from the intermediate files directory if save intermediate files is enabled.

2. `Strelka2` v2.9.10

The input pair of tumor/normal BAM files, along with the candidate small indel file produced by Manta are used by Strelka2 to create lists of somatic single nucleotide and small indel variants, both in VCF format. Lower quality variants that did not pass filtering are subsequently removed, yielding .SNV-pass.vcf.gz and .Indel-pass.vcf.gz files.

GATK Mutect2

1. Define intervals for scattering

The params.intersect_regions of the reference genome are split into x intervals for parallelization, where x is defined by params.scatter_count.

2. Call small somatic variants

Call somatic variants with Mutect2.

3. Merge

Merge scattered outputs (VCFs, statistics).

4. Learn read orientations

Create artifact prior table based on read orientations with GATK's LearnReadOrientationModel.

5. Filter

Filter variants with GATK's FilterMutectCalls, using read orientation prior table and contamination table as well as standard filters.

6. Split VCF

Split filtered VCF into separate files for each variant type: SNVs, MNVs and INDELs.

MuSE

1. `MuSE call`

Pre-filtering and calculating position-specific summary statistics using the Markov substitution model.

2. `MuSE sump`

Computes tier-based cutoffs from a sample-specific error model.

3.Filter VCF

MuSE output has SNVs labeled as PASS or one of Tier 1-5 for the lower confidence calls (Tier 5 is lowest). This step keeps only SNVs labeled PASS.

Pipeline Steps - Variant Intersection

If two or more algorithms were selected the Intersect workflow will run. Currently the resulting VCF and MAF files include any SNVs found by two or more algorithms.

BCFtools isec -n +1; VennDiagram

Determines presence/absence of each SNV within each algorithm's set of filtered SNVs. Results are listed in the output files: isec-1-or-more/README.txt and isec-1-or-more/sites.txt, and are summarized in a Venn Diagram plot (TIFF format).

BCFtools isec -n +2

Determines presence/absence of SNVs found in two or more of each algorithm's set of filtered SNVs, and outputs a consensus VCF for each algorithm containing SNVs found by that algorithm plus at least one other algorithm. Results are also listed in the output files: isec-2-or-more/README.txt and isec-2-or-more/sites.txt.

BCFtools concat

Concatenates the 2+ algorithm consensus SNVs into one VCF (SNV-concat.vcf.gz). The output header is a uniquified concatenation of all input VCF headers. The output fields INFO, FORMAT, NORMAL and TUMOR are from the first listed VCF that has the SNV. Input VCFs are sorted alphanumerically by the algorithm name.

vcf2maf

Converts SNV-concat.vcf.gz from step 3 into MAF format. Output includes allele counts and flanking basepairs, but most fields are blank. Details can be found here.

Inputs

To run the pipeline, one input.yaml and one input.config are needed, as follows.

input.yaml. (see template)

Input	Type	Description
patient_id	string	The name/ID of the patient
tumor_BAM	path	The path to the tumor .bam file (.bai file must exist in same directory)
normal_BAM	path	The path to the normal .bam file (.bai file must exist in same directory)
contamination_table	path	Optional, but only for tumor samples. The path of the `contamination.table`, which is generated from the GATK's `CalculateContamination` in `pipeline-call-gSNP`. The contamination.table path can be found under `pipeline-call-gSNP`'s output `QC` folder

input.yaml should follow the standardized structure:

patient_id: 'patient_id'
input:
  normal:
    - BAM: /path/to/normal.bam
  tumor:
    - BAM: /path/to/tumor.bam
      contamination_table: /path/to/contamination.table

Mutect2 can take other inputs: tumor-only sample and one patient's multiple samples. For tumor-only samples, remove the normal input in input.yaml, e.g. template_tumor_only.yaml. For multiple samples, put all the input BAMs in the input.yaml, e.g. template_multi_sample.yaml. Note, for these non-standard inputs, the configuration file must have 'mutect2' listed as the only algorithm.

input.config (see template)

Input	Required	Type	Description
`algorithm`	yes	list	List containing a combination of somaticsniper, strelka2, mutect2 and muse
`reference`	yes	string	The reference .fa file (.fai and .dict file must exist in same directory)
`intersect_regions`*	yes	string	A bed file listing the genomic regions for variant calling. Excluding `decoy` regions is HIGHLY recommended *
`output_dir`	yes	string	The location where outputs will be saved
`dataset_id`	yes	string	The name/ID of the dataset
`exome`	yes	boolean	The option will be used by `Strelka2` and `MuSE`. When `true`, it will add the `--exome` option to Manta and Strelka2, and `-E` option to MuSE
`save_intermediate_files`	yes	boolean	Whether to save intermediate files
`work_dir`	no	string	The path of working directory for Nextflow, storing intermediate files and logs. The default is `/scratch` with `ucla_cds` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively
`docker_container_registry`	no	string	Registry containing tool Docker images, optional. Default: `ghcr.io/uclahs-cds`
`base_resource_update`	optional	namespace	Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in `template.config` and below.

*Providing intersect_regions is required and will limit the final output to just those regions. All regions of the reference genome could be provided as a bed file with all contigs, however it is HIGHLY recommended to remove decoy contigs from the human reference genome. Including these thousands of small contigs will require the user to increase available memory for Mutect2 and will cause a very long runtime for Strelka2. See Discussion here. For uclahs-cds users, a GRCh38 bed.gz file can be found here: /hot/ref/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz.

Base resource allocation updaters

To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the input.config file. The default allocations can be found in the node-specific config files

base_resource_update {
    memory = [
        [['process_name', 'process_name2'], <multiplier for resource>],
        [['process_name3', 'process_name4'], <different multiplier for resource>]
    ]
    cpus = [
        [['process_name', 'process_name2'], <multiplier for resource>],
        [['process_name3', 'process_name4'], <different multiplier for resource>]
    ]
}

Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.

Examples:

To double memory of all processes:

base_resource_update {
    memory = [
        [[], 2]
    ]
}

To double memory for call_sSNV_Mutect2 and triple memory for run_validate_PipeVal and run_sump_MuSE:

base_resource_update {
    memory = [
        ['call_sSNV_Mutect2', 2],
        [['run_validate_PipeVal', 'run_sump_MuSE'], 3]
    ]
}

To double CPUs and memory for run_sump_MuSE and double memory for run_validate_PipeVal:

base_resource_update {
    cpus = [
        ['run_sump_MuSE', 2]
    ]
    memory = [
        [['run_sump_MuSE', 'run_validate_PipeVal'], 2]
    ]
}

Module Specific Configuration

Input	Required	Type	Description
bgzip_extra_args	no	string	The extra option used for compressing VCFs
tabix_extra_args	no	string	The extra option used for indexing VCFs

Mutect2 Specific Configuration

Input	Required	Type	Description
split_intervals_extra_args	no	string	Additional arguments for the SplitIntervals command
mutect2_extra_args	no	string	Additional arguments for the Mutect2 command
filter_mutect_calls_extra_args	no	string	Additional arguments for the FilterMutectCalls command
gatk_command_mem_diff	yes	nextflow.util.MemoryUnit	How much to subtract from the task's allocated memory where the remainder is the Java heap max. (should not be changed unless task fails for memory related reasons)
scatter_count	yes	int	Number of intervals to split the desired interval into. Mutect2 will call each interval seperately.
germline_resource_gnomad_vcf	no	path	A stripped down version of the gnomAD VCF stripped of all unneeded INFO fields, keeping only AF, currently available for GRCh38:`/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz` and GRCh37: `/hot/ref/tool-specific-input/GATK/GRCh37/af-only-gnomad.raw.sites.vcf`.
panel_of_normals_vcf	no	path	VCF file of sites observed in normal. Currently available for GRCh38: `/hot/ref/tool-specific-input/GATK/GRCh38/1000g_pon.hg38.vcf.gz`. This could be useful for tumor only mode.

MuSE Specific Configuration

Input	Required	Type	Description
dbSNP	yes	path	The path to NCBI's dbSNP database of known SNPs in VCF format, e.g. `GCF_000001405.40.gz`

Variant Intersection Specific Configuration

Input	Required	Type	Description
ncbi_build	yes	string	vcf2maf requires the reference genome build ID, e.g. GRCh38
vcf2maf_extra_args	no	string	additional arguments for the vcf2maf command

Outputs

Tool Outputs	Type	Description
SomaticSniper-{version}_{sample_id}_SNV.vcf.gz	.vcf.gz	Filtered SNV VCF (somaticsniper)
Strelka2-{version}_{sample_id}_SNV.vcf.gz	.vcf.gz	Filtered SNV VCF(strelka2)
Strelka2-{version}_{sample_id}_Indel.vcf.gz	.vcf.gz	Filtered Indel VCF (strelka2)
Mutect2-{version}_{sample_id}_SNV.vcf.gz	.vcf.gz	Filtered SNV VCF (mutect2)
Mutect2-{version}_{sample_id}_Indel.vcf.gz	.vcf.gz	Filtered Indel VCF (mutect2)
Mutect2-{version}_{sample_id}_MNV.vcf.gz	.vcf.gz	Filtered MNV VCF (mutect2)
Mutect2-{version}_{sample_id}_filteringStats.tsv	.tsv	FilterMutectCalls output (mutect2 QC)
MuSE-{version}_{sample_id}_SNV.vcf.gz	.vcf.gz	Filtered SNV VCF (MuSE)
report.html, timeline.html, trace.txt	.html, .txt	Nextflow logs

Intersect Outputs	Type	Description
isec-1-or-more	directory	BCFtools isec README.txt and sites.txt, all variants
isec-2-or-more	directory	BCFtools isec README.txt and sites.txt, variants shared by 2 or more tools
SomaticSniper-{version}_{sample_id}_consensus-variants.vcf.gz	.vcf.gz	`2-or-more` SNV VCF
Strelka2-{version}_{sample_id}_consensus-variants.vcf.gz	.vcf.gz	`2-or-more` SNV VCF
Mutect2-{version}_{sample_id}_consensus-variants.vcf.gz	.vcf.gz	`2-or-more` SNV VCF
MuSE-{version}_{sample_id}_consensus-variants.vcf.gz	.vcf.gz	`2-or-more` SNV VCF
BCFtools-{version}_{sample_id}_Venn-diagram.tiff	.tiff	Venn Diagram with intersection counts for all variants (`1-or-more`)
BCFtools-{version}_{sample_id}_SNV-concat.vcf.gz	.vcf.gz	Single SNV VCF with all `2-or-more` variants and mixed annotation
BCFtools-{version}_{sample_id}_SNV-concat.maf.gz	.maf.gz	Single SNV MAF with all `2-or-more` variants and mixed annotation

Performance Validation

Testing was performed in the Boutros Lab SLURM Development cluster. Metrics below will be updated where relevant with additional testing and tuning outputs. Pipeline version used here is v4.0.0-rc.1

Whole Exomes

General estimates, with wide variation, are that whole exome sequences require 16 cpus and 32 GB of memory to run all of the pipeline algorithms. If MuSE is excluded 8 cpus and 16 GB of memory are sufficient. Run time for a test pair of exome tumor/normal input BAMs of 4 GB/5 GB was in both cases 1 to 2 hours.

Whole Genomes

General estimates, with wide variation, are that whole genome sequences require 72 cpus and 144 GB of memory to run all of the pipeline algorithms. If MuSE is excluded 8 cpus and 16 GB of memory are sufficient, but run time could be very long. Run time for a test pair of WGS tumor/normal input BAMs of 400 GB/200 GB was 15 hours for 72 cpus/144 GB, and 52 hours for 8 cpus/16 GB excluding MuSE.

Mutect2

Duration: 3h 25m 24s

Process call_sSNVInAssembledChromosomes_Mutect2 has been split into 50 intervals, so the following table shows one of those processes:

process_name	max_duration	max_cpu	max_peak_vmem
call_sSNVInNonAssembledChromosomes_Mutect2	32m 44s	142.0%	33.1 GB
call_sSNVInAssembledChromosomes_Mutect2	1h 20m 12s	123.8%	7.8 GB
run_LearnReadOrientationModel_GATK	31m 5s	106.8%	10.2 GB

SomaticSniper

Duration: 9h 21m 23s

process_name	max_duration	max_cpu	max_peak_vmem
convert_BAM2Pileup_SAMtools	4h 18m 29s	98.2%	1.9 GB
call_sSNV_SomaticSniper	8h 48m 45s	98.7%	511.6 MB
generate_ReadCount_bam_readcount	29m 33s	75.9%	261.5 MB

Strelka2

Strelka2's runtime will be significantly improved when using --callRegions option to exclude the non-canoincal regions of the genome, here is the results of CPCG0196: Sample: CPCG0196 Normal BAM: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/outputs/bwa-mem2_and_hisat2-2.2.1/bwa-mem2/bams/a-full-CPCG0196-B1/align-DNA-20210424-024139/pipeline-alignDNA.inputs.CPCG0196-B1.bam Tumor BAM: /hot/resource/pipeline_testing_set/WGS/GRCh38/A/full/CPCG0000000196-T001-P01-F.bam

without `--callRegions`:

process_name	max_duration	max_cpu	max_peak_vmem
call_sIndel_Manta	1h 24m 26s	2724.2%	23.2 GB
call_sSNV_Strelka2	22h 32m 24s	511.3%	17.4 GB

with `--callRegions`:

process_name	max_duration	max_cpu	max_peak_vmem
call_sIndel_Manta	1h 35m 25s	1848.6%	11.7 GB
call_sSNV_Strelka2	59m 19s	3234.0%	8.2 GB

Therefore, we strongly suggest to use the --callRegions if the non-canonical region is unnecessary. -callRegions's input bed.gz file can be found here: /hot/ref/tool-specific-input/Strelka2/GRCh38/strelka2_call_region.bed.gz. For other genome version, you can use UCSC Liftover to convert.

MuSE v2.0

MuSE v2.0 was tested with a normal/tumor paired CPCG0196 WGS sample on a F32 slurm-dev node. Duration: 1d 11h 6m 54s

process_name	max_duration	max_cpu	max_peak_vmem
call_sSNV_MuSE	3h 44m 15s	3181.7%	60.4 GB
run_sump_MuSE	1d 7h 22m 2s	100.0%	41.6 GB

References

Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Fan, Y. et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 17, 178 (2016).

License

Authors: Mao Tian (maotian@mednet.ucla.edu), Bugh Caden, Helena Winata (HWinata@mednet.ucla.edu), Sorel Fitz-Gibbon (sfitzgibbon@mednet.ucla.edu).

pipeline-call-sSNV is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.

This pipeline performs somatic SNV calling on a pair of normal/tumor BAMs, utilizing SomaticSniper, Strelka2, Mutect2 and MuSE.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,471 Commits
.github		.github
config		config
docker/plot-venn		docker/plot-venn
docs		docs
external		external
input		input
module		module
r-scripts		r-scripts
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
metadata.yaml		metadata.yaml
nextflow.config		nextflow.config
nftest.yml		nftest.yml

License

uclahs-cds/pipeline-call-sSNV

Folders and files

Latest commit

History

Repository files navigation

pipeline-call-sSNV

Overview

Somatic SNV callers:

How To Run

Flow Diagrams - Variant Calling

Flow Diagrams - Variant Intersection

Pipeline Steps - Variant Calling

SomaticSniper

1. SomaticSniper v1.0.5.0

2. Filter out ambiguous positions.

a. Get indel pileup summaries

b. Filter indel pileup outputs

c. Filter SomaticSniper VCF

d. Summarize alignment information for retained SNV positions

e. Final filtering of SNVs using metrics summarized above

Strelka2

1. Manta v1.6.0

2. Strelka2 v2.9.10

GATK Mutect2

1. Define intervals for scattering

2. Call small somatic variants

3. Merge

4. Learn read orientations

5. Filter

6. Split VCF

MuSE

1. MuSE call

2. MuSE sump

3.Filter VCF

Pipeline Steps - Variant Intersection

BCFtools isec -n +1; VennDiagram

BCFtools isec -n +2

BCFtools concat

vcf2maf

Inputs

input.yaml. (see template)

input.config (see template)

Base resource allocation updaters

Module Specific Configuration

Mutect2 Specific Configuration

MuSE Specific Configuration

Variant Intersection Specific Configuration

Outputs

Performance Validation

Whole Exomes

Whole Genomes

Mutect2

SomaticSniper

Strelka2

without --callRegions:

with --callRegions:

MuSE v2.0

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

1. `SomaticSniper` v1.0.5.0

1. `Manta` v1.6.0

2. `Strelka2` v2.9.10

1. `MuSE call`

2. `MuSE sump`

without `--callRegions`:

with `--callRegions`: