GPU accelerated comparative phylogenetic analysis with uncertainty

Background

Phylogenetic comparative analysis usually rely on a single consensus phylogenetic tree in order to study evolutionary processes. However, most phylogenetic trees, which may include those based on molecular data, are incomplete with regard to species sampling, as usually only a few species from major lineages are phylogenetically studied. Incomplete sampling may critically compromise macroecological analysis, since statistical inference and biological interpretation may be seriously biased. Recent approaches have been proposed to integrate non molecular phylogenetic information into incomplete molecular phylogenies. These approaches involve a simulation procedure that requires the generation and processing of a large number of randomly generated trees to estimate variance due to phylogenetic uncertainty. Because of the computational burden of generating and processing hundreds of thousands of trees, unless this procedure is efficiently implemented, the analysis are of limited applicability. With that in mind a highly parallel and efficient implementation was developed for randomly generating and processing phylogenetic trees, which can be found on our GitHub repository. By using the power of GPU computing and a massive number of threads we are able to achieve performance gains up to 74x when compared to the fastest known sequential implementation of the same procedures. Innovative data structures and algorithms were employed so as to efficiently process irregular pointer-based data structures such as trees. Moreover, the implementation makes intensive use of bitmaps to efficiently encode paths to the tree nodes, and optimize memory transactions by working with data structures that favors coalesced memory accesses. Our results open up the possibility of accounting for phylogenetic uncertainty in evolutionary and ecological comparative analysis of large datasets.

Nevertheless, there is a drawback. Despite the efforts to makes an efficient implementation, our standalone application has a limited usability as is has no GUI and has to be manually compiled. Knowing that the biological community make an extensive use of the R platform, it will be handy if the functionalities of our application could be used within a convenient environment like R and, at the same time, making use of all potential that GPUs can deliver.

Related work

Our implementation is the parallel version of a software called SUNPLIN, thus the goal of both versions is much the same: creating large number of randomly generated trees and computing a patristic distance matrix for each generated tree. As can be seen in this paper, benchmarks were conducted to compare SUNPLIN efficiency with other well known tools, for instance: AddTips (R script), cophenetic.phylo (function of R package ape) and Phylocom (a well-known open source software written in C). The results show that nowadays the original version of SUNPLIN is the fastest known sequential tool to perform both operations. In other hand, the parallel version of it is able to speedup the execution by up to 74 times in comparison to the original (sequential) version of SUNPLIN, according to the first benchmarks.

Details of the coding project

Two of the three main functionalities the package shall provide are already developed at the CUDA-application side. They are:

Given a newick phylogenetic tree and another file containing a list of PUTs (Phylogenetic Uncertainty Taxa) and their MDCCs (Most Derived Consensus Clade), generate many versions of the original tree by inserting the taxa from the list randomly within each replica of original tree. More details here.
Given a set of phylogenetic trees, calculate the phylogenetic (pairwise) distance between all pair of species of the trees storing the distances into matrices.

Observation: one can note that in order to perform the whole simulation, the output of the first function must be applied to the second so that, as a suggestion, the entire operation could be encapsulated within another function.

CUDA kernels for both the above operations are already implemented and properly running within NVIDIA graphic cards whereas the R-side code must be developed to wrap those calls to GPU through dynamic loading of shared objects. Thus, beside the R-side code, some programming effort has to be done in order to make these functionalities of SUNPLIN be able to run inside shared objects.

The third main function is the calculation of Moran’s autocorrelation coefficient (I-Moran) to quantify whether the distribution of a trait among a set of species is affected or not by their phylogenetic relationships. To perform this operation the function have to receive as input a vector of pairwise distance matrices and a list of traits. The output could be the vector holding all the trees’ I-Moran or the mean of all of it.

This extremely useful operation haven’t been yet developed to run on GPUs. Therefore, not only the R-side wrappers but the CUDA kernel to perform this operation needs both to be developed and extensively tested.

As one could guess, as a second suggestion, another R-side wrapper can be made to accomplish the entire process, from the trees’ expansion to the calculation of Moran’s autocorrelation coefficients, one linked to another.

CRAN compliant documentation for all the functions must obviously be done.

An extra activity that can be carried out is benchmarking the solution proposed against other similar solutions.

Expected impact

Given the wide adoption of the R package by the biology community, our implementations allow one to take uncertainty into account in their analysis, and seamlessly use a number of additional statistical analysis available in R. Overall, our results showed that the proposed algorithms and implementations can play an important role in helping biologists conduct their comparative phylogenetic simulations with uncertainty.

Mentors

Erin Hodgess <erinm.hodgess@gmail.com> is the author of R packages RcmdrPlugin_epack and RcmdrPlugin_doex. Her current areas of research are geostatistics, and high performance computing, using R, Fortran, MPI, and OpenACC.

Thiago Rangel <thiago.rangel@gmail.com> is author of many software programs for ecological analysis, such as SAM (Spatial Analysis in Macroecology) and PAM (Phylogenetic Analysis in Macroecology). His areas of research are macroecology, geographical ecology, biogeography, as well as diversity of human languages. He actively programs in Object Pascal, R, Fortran, MPI, C and C++. He also teaches an introductory course in Programming Logic and Algorithms for graduate students in Ecology and Evolutionary Biology.

Provide feedback

Saved searches