-
Notifications
You must be signed in to change notification settings - Fork 14
FAQ
For pandora compare
a pangenome matrix file pandora_multisample.matrix
indicates which genes/loci were found in each sample. It does not contain entries for genes/loci found in no sample.
For pandora map
, there is no matrix file, but pandora.consensus.fq.gz
contains the sequence for all genes/loci found.
For pandora compare
, the VCF reference sequence is found in pandora_multisample.vcf_ref.fa
. This sequence is inferred within pandora to best represent the differences between samples in the graph.
For pandora map
, the file pandora.consensus.fq.gz
is used as the VCF reference.
In both cases, a VCF reference file containing a representative sequence for each gene/loci can optionally specified, but this is not recommended.
The sequences in the pandora_multisample.vcf_ref.fa
are the "reference sequence" which the VCF is with respect to. Because the reference is a graph and contains multiple alleles, we have to pick one of them to be the equivalent of the "wild type" when creating a VCF. These reference sequences are chosen as paths through the graph, aiming to minimize the distance between each sample and this "reference" (so that we get more SNPs called in the VCF and fewer long alleles).
By default, pandora will only genotype at sites in the graph. This means that new variation not previously seen in the panel of sequences used to construct the reference graph may be missed. De novo discovery checks for regions where the inferred sequence in a given sample has poor coverage across part of a gene/loci. In these regions, it uses the read sequences to infer new alleles to be added to an updated reference graph.
When we calculate the coverage on an allele, we are actually calculating the coverage on kmers which cover the allele. Similarly, we can look at the fraction of these kmers which have no coverage. This is represented by the GAPS field. If an allele is the true allele, not only do we expect to see (relatively) consistent/high coverage over the allele, we also do not expect to see many kmers with no coverage overlapping that allele.
For Illumina data, most variants should have almost equal forward and reverse coverage because we expect on average half of reads to have been generated in the forward direction along the genome, and half in the reverse. For Nanopore data, sequencing biases make it more likely to have a skew between the coverage each way.
The default output files for pandora map
and pandora compare
are different because they are designed for different scenarios. It doesn't really make sense to run pandora map
separately on many samples and then "merge" the VCFs (as is often done with single reference genotypers) because each VCF will be with respect to a different reference by default. However, we may want to know what gene sequences we see when we only have a single sample and that is why we still have pandora map as an option.
This graph has nodes corresponding to genes/loci found in a sample, and edges between nodes if they were found consecutively in a read. In the future, we hope to clean up the graph and so infer the order of genes/loci in the sample genome.
Pandora does not make use of paired information and expects a single read input file. We recommend combining the two files into one.
Pandora reads the input sequence file only until it has the required level of global coverage (default:300X). If this file is sorted by genome location (as may be the case for Illumina data), then the result is no reads covering the later part of the genome will be read.
The reads file path needs to either be relative to the working directory or be a full path. It may be gzip compressed.
The PRG reference file is designed to look like fasta format, but contains additional numbers and spaces within sequences. These are markers indicating where variant sites are.