This repository contains tools and workflows for processing of tumor-only samples. The Kids First DRC recommends running the tumor only pipeline ONLY when no matched normal sample is available. If your data has matched normals we recommend running the Kids First DRC Somatic Variant Workflow instead. This workflow is not a traditional production pipeline run on all data, but rather is run at the user's request.
When comparing the SNV outputs of this workflow to those of the somatic workflow, we have found the outputs to be considerably more noisy. To cut down on this noise, we have included some recommended inputs, parameters, and filters for Mutect2 in our docs. In short we recommend:
- Restrict the callable regions with a blacklist and Panel of Normals (PON)
- Remove low support reads:
- Allele Depth (AD) == 0: WGS uninformative reads
- Variant Allele Frequency (VAF) < 1%: WXS noise
- Remove potential germline variants: gnomAD AF > 0.00003
- Only keep variants that are PASS
- Rescue any variants that fall in hotspot regions/genes
Benchmarking results of SNV calling used to inform our filtering criteria can be found in this README It can also be used to process PDX data by first pre-processing reads using the Xenome tool, explained more here in documentation.
This repository takes advantage of the git submodule feature. The Single Nucleotide Variant annotation workflow is maintained in our Annotation Tools Repository. Therefore, in order to get the code for a submodule you can either:
- Clone the repository recursively with
git clone --recursive
- After cloning, run:
git submodule init && git submodule update
More info on how this worked here
The wrapper workflow that runs most of the tools is found here.
- Mutect2 from GATK 4.2.2.0
- Annotation using the Kids First DRC Somatic SNV Annotation Workflow
- ControlFREEC v11.6
- Manta v1.4.0
Most inputs have recommended values that should auto import both files and parameters
indexed_reference_fasta
: FAI and DICT indexed Homo_sapiens_assembly38.fastamutect2_af_only_gnomad_vcf
: af-only-gnomad.hg38.vcf.gzmutect2_exac_common_vcf
: small_exac_common_3.hg38.vcf.gzgem_mappability_file
: hg38_canonical_150.mappability. If you don't have one for your reference and read length, you can first run the GEM indexer tool, then concatenate those results and convert to a mappability file using the GEM mappability tool.b_allele
: dbSNP_v153_ucsc-compatible.converted.vt.decomp.norm.common_snps.vcf.gz. dbSNP v153 was obtained from the ftp site. Then, using a awk/perl/bash script of your choice, convert NCBI accession names to UCSC-style chromosome names using this table. Next, run the VCF normalization tool, then use bcftools to extract only common snps:bcftools view --include INFO/COMMON=1 --types snps dbSNP_v153_ucsc-compatible.converted.vt.decomp.norm.vcf.gz -O z -o dbSNP_v153_ucsc-compatible.converted.vt.decomp.norm.common_snps.vcf.gz
. Lastly, use tabix to index the resultant file.vep_cache
: homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gzgenomic_hotspots
: tert.bed # bed file with TERT gene promoter regionprotein_snv_hotspots
: kfdrc_protein_snv_cancer_hotspots_20240718.txtprotein_indel_hotspots
: protein_indel_cancer_hotspots_v2.ENS105_liftover.tsvechtvar_anno_zips
: gnomad.v3.1.1.custom.echtvar.zip
input_tumor_aligned
: Indexed BAM/CRAM/SAM fileinput_tumor_name
: sample name, should match read group sample name ininput_tumor_aligned
panel_of_normals
: Mutect2 Panel of Normalswgs_or_wxs
: Choose whether input is Whole Genome Sequencing (WGS) or Whole Exome Sequencing or Panel (WXS)calling_regions
:- For WGS: wgs_canonical_calling_regions.hg38.bed
- For WXS: Unpadded experimental bait capture regions
blacklist_regions
:- For WGS: hg38-blacklist.v2.bed.gz
- For WXS: none
cnv_blacklist_regions
:- For WGS: somatic-hg38_CNV_and_centromere_blacklist.hg38liftover.bed
- For WXS: none
i_flag
: for CNV calling, whether to intersect b allele file. Set toN
skipcfree_sex
: for CNV calling, set to XX for female, XY for malecfree_ploidy
: Array of ploidy possibilities for ControlFREEC to try. Recommend [2,3,4]filtermutectcalls_extra_args
: "--min-allele-fraction 0.01"gatk_filter_name
:["GNOMAD_AF_HIGH", "ALT_DEPTH_LOW"]
gatk_filter_expression
:["gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF > 0.001 && gnomad_3_1_1_FILTER == 'PASS'", "vc.getGenotype('<input_tumor_name>').getAD().1 < 1"]
output_basename
: String value to use as basename for outputs
mutect2_protected_outputs
: VCF with SNV, MNV, and INDEL variant calls and of pipeline soft FILTER-added values in MAF and VCF format with annotation, VCF index, and MAF format outputmutect2_public_outputs
: Protected outputs, except MAF and VCF have had entries with soft FILTER values removedmutect2_bam
: BAM generated will be written as BAM. Useful for debugging
ctrlfreec_pval
: Copy number call with GT (if BAF provided) and p values. Most people want thisctrlfreec_config
: Config file used to runctrlfreec_pngs
: Visualization of CN and BAFctrlfreec_bam_ratio
: Calls as log2 ratioctrlfreec_bam_seg
: Custom made microarray-style SEG filectrlfreec_baf
: b allele frequency filectrlfreec_info
: Calculated information, like ploidy, if a range was given
manta_pass_vcf
: VCF file with SV calls that PASSmanta_prepass_vcf
: VCF file with all SV callsannotsv_annotated_calls
: Manta calls annotated with AnnotSVannotsv_unannotated_calls
: Manta calls not annotated with AnnotSV