Skip to content

Output Files

Ali Pirani edited this page Dec 16, 2020 · 5 revisions

Output Files

All the final results will be saved under date_time_core_results directory under the output folder.


2020_03_01_16_12_11_core_results
├── core_snp_consensus
├── data_matrix
├── gubbins
├── qc_report
└── README


gubbins contains different combinations of core/non-core multi-fasta alignments. Alignments with "gubbins.fa" in their extension will be used as an input for Gubbins. Once the gubbins jobs finished, the pipeline runs RaxML and Iqtree on gubbins generated recombination filtered consensus fasta files. Raxml and iqtree results generated using these files will be placed in raxml_results and iqtree_results folder respectively.

The MSA file that is used for gubbins/and/iqtree is genome_aln_w_alt_allele_unmapped_gubbins.fa

  • genome_aln_w_alt_allele_unmapped_gubbins.fa: This genome alignment cotains all the variants that were called against each sample. It contains all core variant positions, non-core variant positions will be substituted with a dash (denoting unmapped) and N’s for not meeting hard variant filter thresholds or falling under functional class filters. Non-core variants that met all filter thresholds will be substituted with variant allele.

The different types of variants and the symbols that are used for each variant positions are described below:

  1. If a position is unmapped in a sample, then it will be denoted by '-'
  2. If a position did not meet a hard filter criteria in a sample then it will be denoted by 'N' in that particular sample.
  3. If a position falls under any of the functional class filter such as phage region, repeat region or custom mask region, then it will be denoted by 'N'

data_matrix This folder contains different types of data matrices and reports that can be queried for variant diagnostics/QC plots.

  • matrices/SNP_matrix_allele.csv and Indel_matrix_allele.csv: contain allele information for each unique variant position (rownames) called in each individual sample (columns). Rownames are the position where a variant was called in one of the given samples and its associated annotations

  • matrices/SNP_matrix_allele_new.csv will also contain allele information for each unique variant position (rownames) called in each individual sample (columns) except the positions that were unmapped and filtered out will be substituted with dash (-) and N's.

  • matrices/SNP_matrix_code.csv and Indel_matrix_code.csv: contain one of the seven status codes for each unique variant position (rownames) called in individual sample (columns).

  • Functional_annotation_results/phage_region_positions.txt contains positions identified by Phaster as phage regions.

  • Functional_annotation_results/repeat_region_positions.txt contains positions identified by nucmer as tandem repeats.

  • Functional_annotation_results/mask_positions.txt contains positions set by the user to mask from the final core position list.

  • Functional_annotation_results/Functional_class_filter_positions.txt is an aggregated unique list of positions that fall under repeat, mask and phage region (REPEAT_MASK_PHAGE).

  • Row annotation: The rows in the matrix represent a variant position and its associated annotation divided into two parts seperated by a semi colon. An example row annotation is shown below:


Type of SNP at POS > ALT functional=PHAGE_REPEAT_MASK locus_tag=locus_id strand=strand;ALT|Effect|Impact|GeneID|Nrchange|Aachange|Nrgenepos|AAgenepos|gene_symbol|product	Sample_name

Coding SNP at 31356 > T functional=NULL_NULL_NULL locus_tag=USA300HOU_0023 strand=+;T|stop_gained|HIGH|USA300HOU_0023|c.373C>T|p.Gln125*|373/2361|125/786|null or hypothetical protein|possible 5'-nucleotidase;	0

The first part contains information such as type of SNP, position, variant allele found, functional annotation, locus tag for the gene and strand. Second part contains snpEff annotation for the variant found and its impact on the gene.

The different status codes are:

Code Description
-1 Unmapped base denoted by a dash "-"
0 Reference allele base
1 Core SNP
2 SNP called but filtered by hard variant filter
3 Non-core or True variant in this sample but filtered out due to it being filtered in another sample
4 SNP proximate to an Indel masked with an N
-2 Phage Region Masked with an N
-3 FQ Region Masked with an N
-4 MQ Region Masked with an N

Some toy examples of how codes are arranged for different type of variants and how they would be represented in allele matrix is shown below:

  • Reference Allele:

alt tag

  • Unmapped Positions:

alt tag

  • Core Variants:

alt tag

  • Filtered Positions:

alt tag

core_snp_consensus The core_vcf folder under this directory contains annotated core vcf files that were used for generating core SNP consensus fasta results. Other folders contain different combination of core/non-core consensus fasta files for individual samples. The consensus file from these folders are concatenated to generate the multiple sequence alignment file which are then placed in gubbins folder and are used as an input for gubbins jobs.

Clone this wiki locally