-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging: new optimisations and 4-way results #330
Comments
OverviewGoing further with another optimisation (from leoisl@72cd6a0 to leoisl@1c53eb7), where we sort minimiser hits not by their location in the PRG string (which is a quite heavy object) but by their kmer node id, which corresponds to the order of the node in the minimizer DAG. Algorithm-wise, when we map minimizers from reads to PRGs, we need to sort the hits. This specific sort (using the location in the PRG string) just plays a role in a specific case, when we map a read minimizer to a graph that has such minimizer duplicated in two or more places. In this case, we would sort these hits further by their location in the PRG string, but now has been changed to be sorted by the order of the minimizer in the DAG. These two sorts are actually somewhat related, as minimizers that happen earlier in the PRG string have lower id in the minimizer DAG. My personal opinion is that it should not change much the results, and the following 4-way results confirm this. RAM improvement is good, 2.4x less RAM than the previous optimisation ( DetailsDetailed 4-way results follow, only for filtered data. Here we compare the latest release (0.10.0-alpha.0), the version described in the previous post ( Illumina dataThe most improved version, Nanopore dataThe curves for both improved versions, |
The previous post shows that the improvements we've done do not introduce bugs to pandora and we can thus merge. The merge will consist of the 5 PRs (the 1st one is large, the other are small increments):
|
I've also removed RAM values from the previous posts, and I am gathering benchmarking data on how all these improvements reduced RAM. I will update this issue as soon as I get all benchmarking data. |
RAM and runtime improvementsHistory of RAM and runtime improvements for the new version of pandora that will be merged in the next PRs. These benchmarks were done running
Thus when finishing all merges, we will have a version that requires 95% less RAM than current release (~20x improvement on RAM usage) and runs 15.5 times faster than current release. DetailsLSF logs follow:
|
bloody hell @leoisl |
for future readers, RH=roundhound. |
FAR OUT 🔥 |
These results are unbelievable!! Amazing |
Description
This is just a logging for the new upcoming PR that will have 2 major changes in pandora:
And also minor changes:
Results
The major changes should not impact results as they are just RAM improvements. The multimapping improvement should change the results slightly, but hopefully for better. To check if any breaking bug was added, we ran this version against the most updated prerelease on the 4way pipeline. In general, the new version is slightly better precision-wise without denovo, and the old version is slightly better precision-wise with denovo. The differences are however small. RAM improvements are massive and will be detailed in a later post. This will enable pandora to be run with far less computational resources, and it will also speed up the next feature, which is running it on the cluster on hundreds of samples, so I think it is worth to merge these improvements.
Details
Detailed 4-way results follow
Illumina filtered:
Illumina unfiltered:
Nanopore filtered:
Nanopore unfiltered:
The text was updated successfully, but these errors were encountered: