Releases: davidemms/OrthoFinder
OrthoFinder v3.0.1b1
v3.0.1b1
This is the beta version of a major new release of OrthoFinder to allow faster and larger analyses. It allows new species to be assigned to the orthogroups previously inferred for a smaller, core set of species. This removes the need for a costly all-v-all sequences search.
The core species set should contain representatives from the major groups to be included in the larger analysis and should have the same last common ancestor as the larger species set. With this option the new genes are assigned to the core orthogroups and then then full OrthoFinder phylogenetic analysis is performed: MSA-based gene tree inference, species tree inference (using ASTRAL-Pro3, which is new to this release), phylogenetic inference of orthologs & paralogs (gene duplication events). This release includes major work to reduce both runtime and RAM usage. This also includes significant reductions in RAM usage for the standard OrthoFinder analysis.
New in this release
- New
--assign
option to allow species to be assigned to existing orthogroups, removing the need for a costly all-v-all sequence search. - Major reductions in RAM usage within OrthoFinder
- The recommended way to install OrthoFinder is now using conda. The prepackaged OrthoFinder executable has been dropped. The python source code package is also still available if you are happy installing some of the required dependencies yourself.
- MSA-based tree inference is now the default since it scales better to larger analyses (DendroBLAST tree inference can be selected with
-M dendroblast
). For smaller analyses,-M dendroblast
can often be quicker, but MSA-based tree inference is recommended if you have the time. - When using the
--assign
option, ASTRAL-Pro3 is used for species tree inference. - When using the
--assign
option, clade-specific orthogroups are inferred using the original OrthoFinder algorithm on all genes not assigned to existing orthogroups. These clase specific-orthogroups are inferred for the clades of species that fall between the core species, as identified by the species tree. - Orthogroup statistics are now calculated from phylogenetically inferred orthogroups (in N0.tsv)
- To support analyses where there are fewer hits between species (e.g. for clade-specific orthogroups of unassigned genes or for Orthofinder analyses of subsets of genes) a new gene-similarity score has been created. This is used by default for the clade-specific orthogroups and can be selected for a standard OrthoFinder analysis using the option
--scores-v2
. - When using the
--assign
option, lower RAM usage MSA and tree inference options are used by default. These can also be used with standard OrthoFinder analyses using-A mafft_memsave
and-T fasttree_fastest
. - The OrthoFinder phylogenetic analysis can be applied to complete gene families using the option
--c-homologs
. By default, OrthoFinder estimates orthogroups based on an analysis of BLAST/DIAMOND hits together with MCL clustering. These orthogroups split gene families at the last common ancestor species, and OrthoFinder's phylogenetic anlysis is applied to these estimated orthogroups. The--c-homologs
option insteads attempts to identify the complete gene families, infer gene tree inference for each of these families and then infer orthologs and hierarchical orthogroups using an entirely tree-base algorithm. This option should provide more accurate results and also a better understanding of the relationships between orthogroups within their larger gene families, but at a significantly increased computational burden due to the need to infer larger gene trees. This was the option used to infer the gene trees in the https://SHOOT.bio phylogenetic database (paper), where it was referred to as the "-c1" options. - Multi-threading is used in place of multiprocessing to call external executables, reducing RAM usage.
Using the --assign
functionality
- Perform a standard OrthoFinder run using MSA-based tree inference on a core set of species. Results from version 2 OrthoFinder can be used provided MSA-based tree inference was used (in version 3 this is the default).
- Run
orthofinder.py --core ORTHOFINDER_CORE_RESULTS --assign NEW_SPECIES
E.g.
orthofinder.py -f ExampleData/ -n Core
orthofinder.py --core ExampleData/OrthoFinder/Results_Core/ --assign ExampleData/AdditionalSpecies
A guideline for the number of species for the core set is around 8-64 depending on the number of species to be added and their diversity. For a smaller OrthoFinder analysis of, for example, 16 species a core set of 4 or 5 species could be sufficient.
Runtime
A set of 80 vertebrate proteomes (1.7 million sequences) was analysed on an old desktop PC (Intel Core i5-6500, 4 cores & 8 GB RAM) in 20 hours. 7 core species were used as this gave a reasonable sampling.
It has been tested by adding 30 million sequences (equivalent to ~1,500 genomes of 20,000 sequences each) on a large server in approximately 1 week. Of this, the assignment of genes to existing orthogroups took approximately 2 hours (the analysis can be stopped here using the option -og
/ --only-groups
) and the full phylogenetic orthology analysis took the remaining time. Large analyses such as these still require relatively large amounts of RAM (500 GB in this case), but this can be reduced at the cost of a longer runtime by using fewer parallel threads.
OrthoFinder v2.5.5
New in this release
- Reduce number of open files when writing orthologs to approximately one per species instead of one per species-pair, this should resolve issues related to ulimit.
- Added option
--fewer-files
: Requests that OrthoFinder only write one orthologs file per species. This file will list all orthologs in all other species (the default is one file of orthologs for each species pair, listing only the orthologs between those two species).
-- Added scriptscripts_of/split_ortholog_files.py
to recreate one file of orthologs per species-pair from a OrthoFinder results directory produced with the--fewer-files
option. - Dependency checks: Print debug info & preserve test files if dependency checks fail for tools that OrthoFinder calls.
Fixes:
OrthoFinder v2.5.4
New in this release
- Add tool create_files_for_hogs.py for creating sequence fasta files for HOGs
- Extend primary_transcripts.py script to interpret NCBI files
- Reduce RAM usage when trimming for very large alignments
- Resolve #526: Handle multiprocessing error occurring only in old versions of glibc
- Resolve #557: Progress reports were sometimes reported out of order
- Resolve #567: Check that the requested number of threads is positive
- Resolve #570: Use fork instead of spawn on Mac
- Resolve #580: Fix to allow primary transcripts script to work for NCBI isoforms labelled with letters
- Resolve #586: Use tempfile library to handle tmp folders
- Fix a problem with overwriting MSA files
OrthoFinder v2.5.2
New in this release
Added option to use DIAMOND ultra-sensitive: "-S diamond_ultra_sens". This identifies homologs for approximately 2% more genes, depending on how closely the input species are related.
OrthoFinder v2.5.1
New in this release
- Significant speed improvements for large analyses
- For analyses of ~200 species total run times are 2-4x faster
- Parallelisation of final ortholog inference stage of algorithm (number of threads is controlled using "-a" option)
- For MSA tree inference OrthoFinder performs light trimming of the MSA. This prevents the runtime being dominated by tree inference for the largest orthogroups with very gappy MSAs.
- The tree inference using multiple sequence alignments option ("-M msa") is now comparable in speed to the default DendroBLAST method.
OrthoFinder v2.4.1
New in this release
- Improvements to the accuracy of phylogenetically inferred hierarchical orthogroups (HOGs)
- Allow
config_orthofinder_user.json
as an extra config file in user's home directory to allow user-specific options and carrying user options between releases - Allow analysis of nucleotide sequences with
-d
option - Resolve #453
- Resolve #475
- Resolve #476
Details
- Orthogroups are now inferred using gene trees and are found in Phylogenetic_Hierarchical_Orthogroups/N0.tsv etc. The original OGs inferred using clustering are still in Orthogroups/Orthogroups.tsv, but the N0.tsv orthgroups are ~12% more accurate and should be used instead.
- The accuracy can be increased still further (20% more accurate on Orthobench) by including outgroup species, which help with the interpretation of the rooted gene trees. The species tree should then be used to identify the correct HOG file, N??.tsv according to the correct node of the species tree.
- It is important to ensure that the species tree OrthoFinder is using is accurate so as to maximise the accuracy of the HOGs. To reanalyse with a different species tree use the options
-ft PREVIOUS_RESULTS_DIR -s SPECIES_TREE_FILE
. This runs just the final analysis steps "from trees" and is relatively quick. - Further accuracy increases can be obtained by using a lower MCL inflation value (e.g.
-I 1.3
) since this brings more genes into the gene trees, and the HOG algorithm will split the hierarchical orthogroups if required. On Orthobench this gives ~2% increase in accuracy.
OrthoFinder v2.4.0
New in this release
Phylogenetically inferred orthogroups: OrthoFinder now creates a new directory that contains orthogroups defined at each level in the species tree. These orthogroups are inferred by examining the gene trees using the same algorithm that OrthoFinder uses to infer orthologs. Because they are inferred by analysing gene trees they are substantially more accurate than any other method available (and give an approximately 10% relative increase in accuracy on the Orthobench benchmarks compared to OrthoFinder version 2). These files are in the new results directory Phylogenetic_Hierarchical_Orthogroups/.
Because OrthoFinder now infers orthogroups at each phylogenetic level within the species tree it is now possible to include outgroup species in your analysis. Then, to see the orthogroups for just your species of interest just use the corresponding file from the Phylogenetic_Hierarchical_Orthogroups/. The clade names N1, N2, etc. can be found in Species_Tree/SpeciesTree_rooted_node_labels.txt. The use of outgroup species can further increase accuracy (~13% relative increase compared to OrthoFinder v2).
Hierarchical orthogroups are useful because, due to gene duplication events, orthogroups become more fine grained as the species become more closely related:
This is the first of a two part series of developments to increase OrthoFinder orthogroup accuracy using the analysis of gene trees.
Which package to download:
- On Linux download OrthoFinder.tar.gz. This bundles all the required external dependencies (mcl, diamond, fastme) and python libraries and so should run immediately, without any installation being required.
- On Mac the bioconda package is probably the easiest method: See Bioconda getting started and, once bioconda is set up, run
conda install orthofinder
- On either platform you can run the source code version but you will need to have python and the numpy & scipy libraries installed.
- On Windows the best way is to install the Windows Subsystem for Linux and then use the linux version
More detailed instructions here: https://davidemms.github.io/orthofinder_tutorials/alternative-ways-of-getting-OrthoFinder.html
OrthoFinder v2.3.14 (stable)
This is a stable release that fixes any known issues in the previous release.
Which package to download:
- On Linux download OrthoFinder.tar.gz. This bundles all the required external dependencies (mcl, diamond, fastme) and python libraries and so should run immediately, without any installation being required.
- On Mac the bioconda package is probably the easiest method: See Bioconda getting started and, once bioconda is set up, run
conda install orthofinder
- On either platform you can run the source code version but you will need to have python and the numpy & scipy libraries installed.
- On Windows the best way is to install the Windows Subsystem for Linux and then use the linux version
More detailed instructions here: https://davidemms.github.io/orthofinder_tutorials/alternative-ways-of-getting-OrthoFinder.html
Issues resolved
OrthoFinder v2.3.12 (stable)
This is a stable release that fixes any known issues in the previous release.
Which package to download:
- On Linux download OrthoFinder.tar.gz. This bundles all the required external dependencies (mcl, diamond, fastme) and python libraries and so should run immediately, without any installation being required.
- On Mac the bioconda package is probably the easiest method: See Bioconda getting started and, once bioconda is set up, run
conda install orthofinder
- On either platform you can run the source code version but you will need to have python and the numpy & scipy libraries installed.
- On Windows the best way is to install the Windows Subsystem for Linux and then use the linux version
More detailed instructions here: https://davidemms.github.io/orthofinder_tutorials/alternative-ways-of-getting-OrthoFinder.html
Issues resolved
- Update primary_transcript.py for python3, resolves #345
- Vectorise alignment trimming, 45mins->1.5s on 6 species x 3 million base alignment
- Updates to Manual & README
- Set OPENBLAS_NUM_THREADS=1, resolves #356
- Fix reporting of external program error messages
- Exception.message deprecated in python3, resolves #375
- Correct handling of species tree without support values, resolves #379
- Improve handling of commented out species
- Check at start if open file limit is too low and inform user, resolves #384
OrthoFinder v2.3.11
Which version to download:
- On Linux download OrthoFinder.tar.gz. This bundles all the required external dependencies (mcl, diamond, fastme) and python libraries and so should run immediately, without any installation being required.
- On Mac the bioconda package is probably the easiest method: See Bioconda getting started and, once bioconda is set up, run
conda install orthofinder
- On either platform you can run the source code version but you will need to have python and the numpy & scipy libraries installed.
New in this release
- Resolve an issue in some situations when using OrthoFinder on Mac using bioconda. OrthoFinder would find mcl/diamond but would then be unable to call them when required.
- Binary package (OrthoFinder.tar.gz) is now built for glibc versions 2.15 onwards for wider compatibility