FAQ

General

Q: Where can I find gene presence/absence information?

For pandora compare a pangenome matrix file pandora_multisample.matrix indicates which genes/loci were found in each sample. It does not contain entries for genes/loci found in no sample. For pandora map, there is no matrix file, but pandora.consensus.fq.gz contains the sequence for all genes/loci found.

Q: What is the reference for the VCF file?

For pandora compare, the VCF reference sequence is found in pandora_multisample.vcf_ref.fa. This sequence is inferred within pandora to best represent the differences between samples in the graph. For pandora map, the file pandora.consensus.fq.gzis used as the VCF reference. In both cases, a VCF reference file containing a representative sequence for each gene/loci can optionally specified, but this is not recommended.

Q: What are the sequences in `pandora_multisample.vcf_ref.fa`?

The sequences in the pandora_multisample.vcf_ref.fa are the "reference sequence" which the VCF is with respect to. Because the reference is a graph and contains multiple alleles, we have to pick one of them to be the equivalent of the "wild type" when creating a VCF. These reference sequences are chosen as paths through the graph, aiming to minimize the distance between each sample and this "reference" (so that we get more SNPs called in the VCF and fewer long alleles).

Q: What does the de novo discovery do, exactly?

By default, pandora will only genotype at sites in the graph. This means that new variation not previously seen in the panel of sequences used to construct the reference graph may be missed. De novo discovery checks for regions where the inferred sequence in a given sample has poor coverage across part of a gene/loci. In these regions, it uses the read sequences to infer new alleles to be added to an updated reference graph.

Q: What is the GAPS value in the VCF files?

When we calculate the coverage on an allele, we are actually calculating the coverage on kmers which cover the allele. Similarly, we can look at the fraction of these kmers which have no coverage. This is represented by the GAPS field. If an allele is the true allele, not only do we expect to see (relatively) consistent/high coverage over the allele, we also do not expect to see many kmers with no coverage overlapping that allele.

Q: What does it mean when a variant has almost equal forward and reverse coverage?

For Illumina data, most variants should have almost equal forward and reverse coverage because we expect on average half of reads to have been generated in the forward direction along the genome, and half in the reverse. For Nanopore data, sequencing biases make it more likely to have a skew between the coverage each way.

Q: Why don’t we get the same output when mapping single or multiple samples?

The default output files for pandora map and pandora compare are different because they are designed for different scenarios. It doesn't really make sense to run pandora map separately on many samples and then "merge" the VCFs (as is often done with single reference genotypers) because each VCF will be with respect to a different reference by default. However, we may want to know what gene sequences we see when we only have a single sample and that is why we still have pandora map as an option.

Q: What is `pandora.pangraph.gfa` and why is it so messy?

This graph has nodes corresponding to genes/loci found in a sample, and edges between nodes if they were found consecutively in a read. In the future, we hope to clean up the graph and so infer the order of genes/loci in the sample genome.

Q: Is there a paired-end mode for Illumina data

Pandora does not make use of paired information and expects a single read input file. We recommend combining the two files into one.

Q: Why is pandora missing genes in part of the genome?

Pandora reads the input sequence file only until it has the required level of global coverage (default:300X). If this file is sorted by genome location (as may be the case for Illumina data), then the result is no reads covering the later part of the genome will be read.

Q: Why can't pandora open my reads file?

The reads file path needs to either be relative to the working directory or be a full path. It may be gzip compressed.

Q: What are the numbers in my PRG file?

The PRG reference file is designed to look like fasta format, but contains additional numbers and spaces within sequences. These are markers indicating where variant sites are.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

FAQ

Contents

General

Q: Where can I find gene presence/absence information?

Q: What is the reference for the VCF file?

Q: What are the sequences in `pandora_multisample.vcf_ref.fa`?

Q: What does the de novo discovery do, exactly?

Q: What is the GAPS value in the VCF files?

Q: What does it mean when a variant has almost equal forward and reverse coverage?

Q: Why don’t we get the same output when mapping single or multiple samples?

Q: What is `pandora.pangraph.gfa` and why is it so messy?

Q: Is there a paired-end mode for Illumina data

Q: Why is pandora missing genes in part of the genome?

Q: Why can't pandora open my reads file?

Q: What are the numbers in my PRG file?

Clone this wiki locally

FAQ

FAQ

Contents

General

Q: Where can I find gene presence/absence information?

Q: What is the reference for the VCF file?

Q: What are the sequences in pandora_multisample.vcf_ref.fa?

Q: What does the de novo discovery do, exactly?

Q: What is the GAPS value in the VCF files?

Q: What does it mean when a variant has almost equal forward and reverse coverage?

Q: Why don’t we get the same output when mapping single or multiple samples?

Q: What is pandora.pangraph.gfa and why is it so messy?

Q: Is there a paired-end mode for Illumina data

Q: Why is pandora missing genes in part of the genome?

Q: Why can't pandora open my reads file?

Q: What are the numbers in my PRG file?

Clone this wiki locally

Q: What are the sequences in `pandora_multisample.vcf_ref.fa`?

Q: What is `pandora.pangraph.gfa` and why is it so messy?