-
Notifications
You must be signed in to change notification settings - Fork 4
Updating reference packages
TreeSAPP comes bundled with a core set of reference packages. Many of these were created using sequences from public repositories such as from TIGRFAM, EggNOG v5.0, and FunGene. As the community continually discovers new lineages of life, however, these reference packages will inevitably become outdated.
TreeSAPP has a module for updating reference packages with new homologous sequences, called treesapp update
, that will be the topic of this tutorial.
Supported versions>=0.8.7
treesapp update
is the only method for updating reference packages i.e., adding classified sequences to an existing reference package. It requires the output from treesapp assign
, specifically a FASTA file containing classified sequences and the classification table, and a reference package to update.
From the set of classified sequences treesapp update
will identify those that are sufficiently novel based on sequence similarity1. The novel sequences will then be passed to treesapp create
along with the original reference sequences.
The most basic command just requires -i/--fastx_input
, -r/--refpkg_path
and --treesapp_output
. In this example we'll use data from RefPkgs2. Also, since these are amino acid sequences we need to provide the molecule type with -m/--molecule
.
treesapp update --molecule prot \
--fastx_input Photosynthesis/PuhA/ENOG4111FIN_PF03967_seed.faa \
--refpkg_path Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl \
--treesapp_output Photosynthesis/PuhA/assign_SwissProt_PuhA/
The above command may take a while to run so we can use arguments that are also used by treesapp create
to speed things up a bit. --fast
is used to use FastTree to infer the phylogeny rather than RAxML-NG
. We can specify the number of bootstrap replicates to perform with -b/--bootstraps
and to skip that step entirely (it does take a while...) we can set it to 0. With the flag --trim_align
TreeSAPP will use BMGE to remove low-entropy columns in multiple sequence alignments.
--cluster
should be used with treesapp update
so the new reference sequence candidates are clustered to remove closely related sequences. By default the proportional similarity is parsed from the original reference package (the 'pid' attribute) but can be changed using the -p/--similarity
argument.
treesapp update --molecule prot \
--fastx_input Photosynthesis/PuhA/ENOG4111FIN_PF03967_seed.faa \
--refpkg_path Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl \
--treesapp_output Photosynthesis/PuhA/assign_SwissProt_PuhA/ \
--output ./TreeSAPP_update \
--num_proc 4 --boostraps 0 --trim_align --cluster --fast --headless --overwrite
Depending on where the query sequences classified by treesapp assign
were sourced from, it may be useful to include these as reference sequences with the TreeSAPP-assigned taxonomic labels. However, in cases where these sequences were from databases it is typically beneficial to use the true taxonomic lineage3.
If you want to include sequences that haven't been uploaded and accessioned in Entrez, you can do so by either providing a table with sequence name, organism and lineage information or modifying the sequence's FASTA header to follow the format:
SeqID lineage=cellular organisms; Domain; Phylum; Class [Organism_name]
where SeqID should be replaced with a temporary, unique accession or ID (e.g. AMH87091), "Domain; Phylum; Class" need to be replaced with the appropriate values for the organism this sequence was derived from, and Organism_name should be replaced with the appropriate organism name, such as 'Hydrogenobacter thermophilus TK-6', no quotes required.
The lineage table can be provided using the --seqs2lineage
argument. The format of the lineage table is tabular or comma-separated with the fields:
SeqID, Organism, Lineage, Domain, Phylum, Class, Order, Family, Genus, Species
SeqID must be the first field and it can be a prefix of the query sequences that were placed. This is useful when the queries are contigs (i.e. nucleotide sequences) but the open-reading frames were classified. Providing the contig name as the SeqID will work.
Some of these fields are redundant and not all are required. If the “Lineage” field is provided, the separated rank fields (e.g. “Domain”, “Phylum”, etc.) are not needed and vice versa.
If the lineage field is used there are two things that can help you and TreeSAPP: first off, other non-canonical ranks (e.g. sub-order, super-phylum, etc.) should be removed from this file. Secondly, it is best to explicitly include the rank-prefixes in the lineages (d__Archaea; p__Euryarchaeota) so TreeSAPP doesn’t have to do this later. The “Organism” field is also optional and the most resolved rank is used in its place if it is missing.
In both cases, the flag --skip_assign
needs to be used.
treesapp update --molecule prot \
--fastx_input Photosynthesis/PuhA/ENOG4111FIN_PF03967_seed.faa \
--refpkg_path Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl \
--treesapp_output Photosynthesis/PuhA/assign_SwissProt_PuhA/ \
--output ./TreeSAPP_update \
--num_proc 4 --boostraps 0 --trim_align --cluster --fast --headless --overwrite \
--delete --skip_assign
When reference packages are updated with sequences that are derived from a known source, such as an isolate or single-cell amplified genome, you have the option to replace the original reference sequences. The flag --resolve
is used to toggle this mode, which by default will retain all of the original reference sequences in the updated reference package.
With the --resolve
flag, when a new candidate reference sequence falls within the same cluster as an original reference sequence the taxonomic lineages are compared. If the candidate's taxonomic lineage is better resolved (i.e. longer) than the original's then the original is removed and the candidate will move on to represent that cluster.
treesapp update --molecule prot \
--fastx_input Photosynthesis/PuhA/ENOG4111FIN_PF03967_seed.faa \
--refpkg_path Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl \
--treesapp_output Photosynthesis/PuhA/assign_SwissProt_PuhA/ \
--output ./TreeSAPP_update \
--num_proc 4 --boostraps 0 --trim_align --cluster --fast --headless --overwrite \
--delete --skip_assign \
--resolve --seqs2lineage Photosynthesis/PuhA/SwissProt_PuhA_lineages.tsv
- There are plans to leverage the evolutionary distances between each qeury sequence and its most closely related reference sequence(s) rather than clustering with UCLUST.
- You can access data from RefPkgs by cloning the repository locally. It is a good idea to clone this repository to access the full complement of TreeSAPP reference packages. Please consider contributing to this repository with any new reference packages you build!
- The sequence names (i.e. headers) must be from either EggNOG, TIGRFAM, PFam, FunGene or one of the many Entrez databases. Alternatively, just an Entrez-style accession ID will work.