A set of scripts to assist the calling of somatic structural variants from short reads using a random forest classifier.
Download consensus PCAWG calls from:
https://dcc.icgc.org/api/v1/download?fn=/PCAWG/consensus_sv/final_consensus_sv_bedpe_passonly.icgc.public.tgz
https://dcc.icgc.org/api/v1/download?fn=/PCAWG/consensus_sv/final_consensus_sv_bedpe_passonly.tcga.public.tgz
The PCAWG consensus call set for SVs (v1.6, n=309,246) is used as a reference for somatic SVs. The PCAWG SVs are in hg19 genome coordinates, thus is necessary to perform a liftover to GRCh38 using crossmap (v0.3.9). PCAWG SVs at sample level (n=2,748) were merged into a non-redundant cohort callset with SURVIVOR (v1.0.7) (merge subcommand), leading to a total of 283,980 non-redundant somatic SVs.
The gnomAD database (v2.1, n=299,211) GRCh38 liftover.
https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/nstd166.GRCh38.variant_call.vcf.gz
The following databases, BED format, should be provided,
are used to annotate the location/impact of SVs. The directory bedfiles provide examples for each of the aforementioned BED files. Note that files should be uncompressed.
Call germline SVs using SvABA (v1.1.0), Manta (v1.6.0), and Delly (v0.8.3) for building a custom panel of normal for each caller, follow the best practices for calling germline variants provided by each tool. Finally, make a consensus for each caller by integrating several samples with SURVIVOR (merge subcommand).
The matrices for training and predictions are built using the script svtools.pl, the script should be called for each tumor and matched sample:
# manta example, similar calls should be done for DElly and SVaba
perl ssvht/svtools.pl -a Sample.tumor_only.pass.vcf \
-b Manta.PoN.vcf -c gnomad.2.1.GRCh38.vcf \
-d all_pcawg.hg38.non_redundant.vcf -s Sample.matched.pass.vcf \
-p <prefix> -x ssvht/bedfiles/bedfiles_list.txt
In brief, the -a option provides the VCF calls that PASS filter for the tumor-only sample, -b provides the Panel-of-Normal file (VCF), -c and -d provide the reference SV databases, -s the somatic calls in case of training (or empty file), -p is the prefix and -x the path to bedfiles needed to annotate the location/impact of SVs.
The output is a matrix at the sample level with all the features needed by the model (see BioRXiv), including:
Tumor ID CHROM POS TYPE SVLEN PE_SR PE SR PON PON_SUPP PON_TYPE PON_IDS PON_BC1 PON_BC2 GNOMAD GNOMAD_AC GNOMAD_TYPE GNOMAD_IDS GNOMAD_BC1 GNOMAD_BC2 PCAWG PCAWG_SUP PCAWG_TYPE PCAWG_IDS PCAWG_BC1 PCAWG_BC2 SOMATIC SOMATIC_TYPE SOMATIC_IDS COSMIC_GENE CENTROMER EXON CONSERVATION_BC1 CONSERVATION_BC2 CNV_TCN1 CNV_TCN2 CNV_CF1 CNV_CF2 RAF RFS
SAMPLE_T INV00000004 chr1 10433 INV 248935200 3 0 3 1 5 INV INV00000001 5 11 0 0 0 0 0 0 0 0 0 0 3 2 0 0 0 0 0 0 0 0 2 2 0 1 0.122222222222222 90
SAMPLE_T DUP00000008 chr1 789482 DUP 223225120 43 43 0 1 45 DUP DUP00000008 6 12 0 0 0 0 5 0 0 0 0 0 1 6 0 0 0 0 0 0 6 11 2 2 1 1 0.686567164179104 67
SAMPLE_T DEL00000009 chr1 789502 DEL 223222907 17 17 0 1 28 DEL DEL00000006 6 12 0 0 0 0 5 2 0 0 0 0 1 6 0 0 0 0 0 0 6 11 2 2 1 1 0.390243902439024 41
SAMPLE_T DEL00000018 chr1 934105 DEL 740 44 44 0 1 40 DEL DEL00000014 2 2 1 4 DUP nssv15963806 9 9 1 4 DEL SV0196854 9 9 0 0 0 0 0 0 6 6 2 2 1 1 0.978723404255319 47
SAMPLE_T DEL00000022 chr1 1028337 DEL 737 3 3 0 1 22 DEL DEL00000019 5 5 1 2134 DEL nssv15852976 4 6 0 0 0 0 2 2 0 0 0 0 0 0 6 6 2 2 1 1 0.0833333333333333 36
SAMPLE_T INV00000026 chr1 1067280 INV 1462 7 0 7 1 9 INV INV00000022 3 3 0 0 0 0 2 2 0 0 0 0 2 2 0 0 0 0 0 0 8 8 2 2 1 1 0.448979591836735 98
SAMPLE_T DEL00000027 chr1 1288706 DEL 1162 6 6 0 1 6 DEL DEL00000019 7 6 1 6 DUP nssv15960599 19 16 0 0 0 0 5 5 0 0 0 0 0 0 5 5 2 2 1 1 1 4
SAMPLE_T DEL00000028 chr1 1350077 DEL 1338 7 7 0 1 17 DEL DEL00000025 4 4 1 3456 DEL nssv15848677 5 5 0 0 0 0 2 2 0 0 0 0 0 0 5 5 2 2 1 1 0.538461538461538 13
SAMPLE_T DEL00000031 chr1 1454714 DEL 30726 3 3 0 0 0 0 0 0 1 0 0 0 0 9 12 0 0 0 0 4 3 0 0 0 0 0 1 5 5 2 2 1 1 0.0461538461538462 65
The same script can be used to build a matrix containing the set of predicted somatic SVs from matched genomic data (-s option).
The directory rcode provide the R scripts for training and evaluation of the random forest models for each caller. The inputs are the matrices generated on STEP 2.
After classifying the tumor-only SVs into somatic and germline for each of the SV callers, a consensus per sample can be built using the SURVIVOR code (merge subcommand).
Some of the scripts might contain hard-coded paths which need to be replaced to match your data.