OrthoFinder v3.0.1b1
Pre-releasev3.0.1b1
This is the beta version of a major new release of OrthoFinder to allow faster and larger analyses. It allows new species to be assigned to the orthogroups previously inferred for a smaller, core set of species. This removes the need for a costly all-v-all sequences search.
The core species set should contain representatives from the major groups to be included in the larger analysis and should have the same last common ancestor as the larger species set. With this option the new genes are assigned to the core orthogroups and then then full OrthoFinder phylogenetic analysis is performed: MSA-based gene tree inference, species tree inference (using ASTRAL-Pro3, which is new to this release), phylogenetic inference of orthologs & paralogs (gene duplication events). This release includes major work to reduce both runtime and RAM usage. This also includes significant reductions in RAM usage for the standard OrthoFinder analysis.
New in this release
- New
--assign
option to allow species to be assigned to existing orthogroups, removing the need for a costly all-v-all sequence search. - Major reductions in RAM usage within OrthoFinder
- The recommended way to install OrthoFinder is now using conda. The prepackaged OrthoFinder executable has been dropped. The python source code package is also still available if you are happy installing some of the required dependencies yourself.
- MSA-based tree inference is now the default since it scales better to larger analyses (DendroBLAST tree inference can be selected with
-M dendroblast
). For smaller analyses,-M dendroblast
can often be quicker, but MSA-based tree inference is recommended if you have the time. - When using the
--assign
option, ASTRAL-Pro3 is used for species tree inference. - When using the
--assign
option, clade-specific orthogroups are inferred using the original OrthoFinder algorithm on all genes not assigned to existing orthogroups. These clase specific-orthogroups are inferred for the clades of species that fall between the core species, as identified by the species tree. - Orthogroup statistics are now calculated from phylogenetically inferred orthogroups (in N0.tsv)
- To support analyses where there are fewer hits between species (e.g. for clade-specific orthogroups of unassigned genes or for Orthofinder analyses of subsets of genes) a new gene-similarity score has been created. This is used by default for the clade-specific orthogroups and can be selected for a standard OrthoFinder analysis using the option
--scores-v2
. - When using the
--assign
option, lower RAM usage MSA and tree inference options are used by default. These can also be used with standard OrthoFinder analyses using-A mafft_memsave
and-T fasttree_fastest
. - The OrthoFinder phylogenetic analysis can be applied to complete gene families using the option
--c-homologs
. By default, OrthoFinder estimates orthogroups based on an analysis of BLAST/DIAMOND hits together with MCL clustering. These orthogroups split gene families at the last common ancestor species, and OrthoFinder's phylogenetic anlysis is applied to these estimated orthogroups. The--c-homologs
option insteads attempts to identify the complete gene families, infer gene tree inference for each of these families and then infer orthologs and hierarchical orthogroups using an entirely tree-base algorithm. This option should provide more accurate results and also a better understanding of the relationships between orthogroups within their larger gene families, but at a significantly increased computational burden due to the need to infer larger gene trees. This was the option used to infer the gene trees in the https://SHOOT.bio phylogenetic database (paper), where it was referred to as the "-c1" options. - Multi-threading is used in place of multiprocessing to call external executables, reducing RAM usage.
Using the --assign
functionality
- Perform a standard OrthoFinder run using MSA-based tree inference on a core set of species. Results from version 2 OrthoFinder can be used provided MSA-based tree inference was used (in version 3 this is the default).
- Run
orthofinder.py --core ORTHOFINDER_CORE_RESULTS --assign NEW_SPECIES
E.g.
orthofinder.py -f ExampleData/ -n Core
orthofinder.py --core ExampleData/OrthoFinder/Results_Core/ --assign ExampleData/AdditionalSpecies
A guideline for the number of species for the core set is around 8-64 depending on the number of species to be added and their diversity. For a smaller OrthoFinder analysis of, for example, 16 species a core set of 4 or 5 species could be sufficient.
Runtime
A set of 80 vertebrate proteomes (1.7 million sequences) was analysed on an old desktop PC (Intel Core i5-6500, 4 cores & 8 GB RAM) in 20 hours. 7 core species were used as this gave a reasonable sampling.
It has been tested by adding 30 million sequences (equivalent to ~1,500 genomes of 20,000 sequences each) on a large server in approximately 1 week. Of this, the assignment of genes to existing orthogroups took approximately 2 hours (the analysis can be stopped here using the option -og
/ --only-groups
) and the full phylogenetic orthology analysis took the remaining time. Large analyses such as these still require relatively large amounts of RAM (500 GB in this case), but this can be reduced at the cost of a longer runtime by using fewer parallel threads.