Skip to content

Project‐Level VCF Pipeline

Adam English edited this page Oct 15, 2024 · 6 revisions

A Project-Level VCF (pVCF) contains variants across multiple samples and genotypes for every variant from every sample. Below is a description of steps to create a pVCF.

Step 1: SV Discovery

First we'll need to discover the structure of all SVs across the samples. Here we'll be using sniffles. However, any variant discovery tool that produces sequence-resolved, breakpoint exact SVs will suffice. This means SV callers that report symbolic alleles (e.g. <DEL>, <DUP>, BNDs) cannot be handled by kanpig.

sniffles --input input.cram/input.bam --vcf output.vcf.gz --reference reference.fa

Step 2: Merging SVs

With all the SVs discovered, we now want to consolidate all the variants and collapse redundant variant representations. This will produce the alleles that compose the variant-graph which kanpig will genotype.

First, SVs should be consolidated using bcftools.

bcftools merge -m none -l file_list.txt -O z -o merged.vcf.gz 
tabix merged.vcf.gz

Where file_list.txt holds paths to all of the sniffles output.vcf.gz files.

Next, redundant SVs should be identified and removed from the consolidated VCF in order to simplify the variant-graphs kanpig will genotype.

truvari collapse -i merged.vcf.gz -c removed.vcf.gz --keep common | bcftools sort -O z -o collapsed.vcf.gz

Step 3 (optional): Clean the VCF

Since kanpig will be replacing all of the genotypes we can remove the sample columns from the VCF. However, we'll keep the first sample column just to make sure the VCF stays compliant.

gunzip -c collapsed.vcf.gz | cut -f1-10 | bgzip > nosamp.vcf.gz

Additionally, we can remove large/small variants that kanpig won't be analyzing

bcftools view -i "SVLEN >= 50 && SVLEN <= 10000" nosamp.vcf.gz -O z -o variant_graph.vcf.gz
tabix variant_graph.vcf.gz

Another optional final filter we can apply is removing regions of the reference with an abnormal number of variants. This will help kanpig's runtime and will help exclude variants in problematic regions early. See truvari documentation for details on how to make a bed file that excludes high density regions which can be used by kanpig.

Step 4: Running kanpig

Below is an example kanpig command that works generally well for genotyping multi-sample VCFs.

kanpig --input variant_graph.vcf.gz \
       --bam sample.bam \
       --reference reference.fa \
       --sample my_sample \
       --hapsim 0.97 \
       --chunksize 500 \
       --maxpaths 1000 \
       --gpenalty 0.04 \
    | bcftools sort -O z -o sample.vcf.gz

Be sure to use the --ploidy-bed (details) if possible. and use --bed if you made the bed file in step 3 which excludes regions with high-density of SVs.

Step 5: Creating the pVCF.

Since kanpig runs on a per-sample basis, you'll have one VCF for each sample. These SVs can be recombined to make the project level VCF. While recombining samples, we can also fill in some basic annotations and remove putative false positive variants which kanpig wasn't able to find support for in any sample.

bcftools merge -m none -l file_list.txt \
    | bcftools +fill-tags \
    | bcftools view -i "AC != 0" -O z -o final.vcf.gz

Where file_list.txt holds paths to all of the kanpig sample.vcf.gz results