Skip to content
Ryan Wick edited this page Sep 18, 2023 · 19 revisions

Verticall logo

The problem

Phylogenetic trees of bacteria aim to reconstruct their vertical (i.e. via parent cell to daughter cell) evolutionary history. And if bacteria only evolved vertically, then you could confidently use their entire genomes when building a tree. However, many bacteria also exchange DNA horizontally, e.g. via phage integration or homologous recombination. This means that different parts of bacterial genomes can have different evolutionary histories, so building a tree with the entire genome can yield a confused and distorted tree. So if you have a collection of bacterial genomes and want to build a tree that accurately reflects their vertical evolutionary history, you want to only use the parts of their genomes that were inherited vertically – i.e. ignore horizontally acquired parts of their genomes.

Programs such as Gubbins and ClonalFrameML have been developed to solve this exact problem, and they can work very well in some circumstances. But they require closely related genomes and don't scale well to very large numbers. E.g. you can't run Gubbins/ClonalFrameML on genomes that span a species such as E. coli – there's too much variation. And you can't run Gubbins/ClonalFrameML on a collection of 10000 genomes – it would take too long.

The solution

Verticall is a tool for building recombination-free trees, and it works in contexts that Gubbins and ClonalFrameML do not. In addition to finding/masking recombination from outside the genomes (i.e. regions with too much sequence divergence), it can also find/mask recombination from within the genomes (i.e. regions with too little sequence divergence). This allows it to handle more diverse datasets than other tools.

Briefly, Verticall works by conducting pairwise alignment between genome assemblies, non-parametrically determining the vertical-only genomic distance and then labelling regions of the assemblies as either vertical or horizontal (see Pairwise assembly comparison for details).

Verticall can be run in two distinct ways, each with their advantages:

  1. Conduct all pairwise assembly comparisons and use the results to build a distance matrix. You can then use that distance matrix to build a distance tree. This mode is appropriate for diverse datasets, even spanning multiple species. See Distance tree workflow for details.
  2. Compare each assembly to a reference genome to mask horizontal regions from a SNP matrix. You can then use that masked SNP matrix to build an ML tree. This mode is appropriate for very large datasets with thousands of genomes. See Alignment tree workflow for details.

Some caveats

Before you dive in to using Verticall, here are some things to keep in mind:

  • Verticall doesn't build trees itself. It just produces a distance matrix (if you used the distance tree workflow) or a masked alignment (if you used the alignment tree workflow). The actual tree-building needs to be done by a separate program.
  • Since Verticall is assembly-based, you'll need to assemble your genomes before you can use it. Good assemblies (e.g. with a big N50) are better, but fragmented assemblies are okay.
  • Verticall takes a more broad-brushstroke approach to finding recombination than Gubbins, i.e. it finds/masks recombination in larger chunks. This means that if your dataset is suitable for Gubbins (i.e. a small and closely related group of genomes), then it will probably give you better results.

Where to begin?

Are you new to Verticall and interested in trying it out? If so, you'll first need to get it installed, so check out the Software requirements and installation page. Then head over to the Quick start page for a concise overview on how to run Verticall.

A manuscript for Verticall is in the works, so stay tuned! If you need to cite it in the meantime, you can cite this repo using this DOI:

DOI