-
Notifications
You must be signed in to change notification settings - Fork 10
Performance test
We achieved significant improvements in the runtime of PhyloProfile v2.0 by optimizing the underlying code. These optimizations included the use of more efficient data types, eliminating unnecessary processing steps, and utilising faster packages.
Using simulated data with increasing numbers of taxa and genes, we observed that as the dataset size grows, the differences in runtimes between v2.0 and older versions (v1.20 and v1.18) become more pronounced. Notably, v2.0 has a noticeably lower runtime slope, indicating a more scalable performance as the dataset grows.
PhyloProfile v2.0 also shows a dramatic improvement in input loading time.
Dataset | Version | File size | Number of data points | Loading time |
---|---|---|---|---|
5000 x 5000 | v1.18 | 766 mb | 12.6 million | 4:51 |
5000 x 5000 | v1.20 | 766 mb | 12.6 million | 5:09 (*) |
5000 x 5000 | v2.0 | 766 mb | 12.6 million | 0:55 (*) |
10.000 x 10.000 | v2.0 | 3.0 gb | 50.4 million | 4:21 (*) |
(*) including the automatic prediction of reference species
When comparing the runtime of the three versions on two real datasets, PhyloProfile v2.0 showed significant speed-ups:
- For the Aci dataset, v2.0 achieved a 4-5 times speed-up
- For the pPCD dataset, v2.0 reduced the runtime by more than 2 times compared to v1.20 and v1.18
pPCD: 235 potential plant cell wall degradation proteins across ~18,000 taxa from 3 domains of the tree of life (ref.)
Aci: 3676 genes of Acinetobacter baumannii ATCC 19606 strain across 881 taxa in the Acinetobacter genus
PhyloProfile v2.0 can now generate a plot for a matrix of 7.5 million cells in the same time that v1.0.0 required to process 800,000 cells. Note: this is an unfair comparison, since the new benchmark was run on a different computer (MacBook Air M1, 16gb ram).
We checked the performance of PhyloProfile with increasing data size.
In brief, the time required for both importing and plotting the full data (Figure 1), and RAM usage (Figure 2) scales linearly with the size of the data matrix. Plotting of the first 30 genes (default setting; cf point 2. below) is independent of the data size. The phylogenetic profile of a moderate sized data set comprising 200 genes and 200 species (40,000 cells) takes about 10 seconds to display, both on the standalone version and on the online version.
In detail, we assessed the performance of PhyloProfile on a locally installed version using a Macbook Pro CPU core i7 2.8ghz, 8gb ram. As test data served the phylogenetic profiles of 1,605 microsporidian proteins across 489 species. The full data matrix comprises 784,845 cells. It takes about 70 seconds to load the data and about 180 seconds to plot the entire matrix. We then reduced the data matrix stepwise by either considering fewer genes (Fig. 1a) or fewer taxa (Fig. 1b), and measured the time to upload and plot the data.
Figure 1. The running time of PhyloProfile for uploading (yellow) and plotting phylogenetic profiles of all (green) or the first 30 genes (red) scales linearly with data size. (a) Running time as a function of number of genes analyzed. (b) Running time as a function of number of taxa analyzed.
The results indicate that PhyloProfile facilitates a reasonably quick interactive exploration of the data for data comprising up to a few hundreds of genes and taxa. We trust that this will be sufficient for the vast majority of applications, as we expect that a typical user will be interested in exploring phylogenetic profiles of gene sets representing, e.g. one ore few KEGG pathways. However, the analysis of substantially larger data is also possible, and the option to extract subsets of interest via the customized profile option allows to streamline and speed up the analysis.
Figure 2. RAM usage during data display increases linearly as the data matrix grows. (a) RAM usage as a function of number of genes analyzed, and (b) as a function of the number of taxa analyzed.
The online version of PhyloProfile currently runs on the shinyapps.io webserver that is provided as a service to the community by RStudio Inc. The performance of the online version is comparable to the standalone version with respect to speed of data upload and plotting of the profiles. However, we would like to emphasize that the online version is meant for small to moderate size analyses. Still, we could upload and plot data up to a matrix size of 200.000 cells. With larger data sets, the server starts disconnecting. For a regular use of PhyloProfile with larger data sets, we encourage the user to download and install PhyloProfile locally, which is straightforward even to uninitiated users, and we provide an installation guide online. We are moving the online version of PhyloProfile into our internal webserver in order to overcome the limitation of the shinyapps.io server. A new performance test will be done in the near future.