Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging: new optimisations and 4-way results #330

Closed
leoisl opened this issue Jun 12, 2023 · 9 comments
Closed

Logging: new optimisations and 4-way results #330

leoisl opened this issue Jun 12, 2023 · 9 comments

Comments

@leoisl
Copy link
Collaborator

leoisl commented Jun 12, 2023

Description

This is just a logging for the new upcoming PR that will have 2 major changes in pandora:

  1. Lazy loading of PRGs (improves RAM significantly in the plasmid/roundhound use case);
  2. Do not keep read data (e.g. all minimiser hits each read has, all PRGs it maps to, etc...), but process this data as early as possible and release the memory. Improves RAM significantly in all cases;

And also minor changes:

  1. Code cleanup, removing unused gene DBG and noise filtering modules;
  2. If the best mapping of a read is to several graphs, we choose one at random (before was deterministic, so a single graph would get all the mappings)

Results

The major changes should not impact results as they are just RAM improvements. The multimapping improvement should change the results slightly, but hopefully for better. To check if any breaking bug was added, we ran this version against the most updated prerelease on the 4way pipeline. In general, the new version is slightly better precision-wise without denovo, and the old version is slightly better precision-wise with denovo. The differences are however small. RAM improvements are massive and will be detailed in a later post. This will enable pandora to be run with far less computational resources, and it will also speed up the next feature, which is running it on the cluster on hundreds of samples, so I think it is worth to merge these improvements.

Details

Detailed 4-way results follow

Illumina filtered:

  • This is the most important illumina results, with filters being cov: 5, strand bias 0.05, gaps 0.8 (the filters we used in the paper);
  • Without denovo, the new version is slightly better precision-wise;
  • With denovo, both versions are equivalent;

image

Illumina unfiltered:

  • Shown here just for completeness. We should always apply lenient filters to pandora output...
  • Without denovo, the new version is better for higher GT conf threshold. When GT conf is lowered, both versions converge to the same results;
  • With denovo, v0.10.0-alpha.0 is slightly better precision-wise;

image

Nanopore filtered:

  • Without denovo, the new version is slightly better precision-wise;
  • With denovo, v0.10.0-alpha.0 is slightly better precision-wise;

image

Nanopore unfiltered:

  • v0.10.0-alpha.0 is slightly better precision-wise with and without denovo;
    image
@leoisl
Copy link
Collaborator Author

leoisl commented Jun 14, 2023

Overview

Going further with another optimisation (from leoisl@72cd6a0 to leoisl@1c53eb7), where we sort minimiser hits not by their location in the PRG string (which is a quite heavy object) but by their kmer node id, which corresponds to the order of the node in the minimizer DAG. Algorithm-wise, when we map minimizers from reads to PRGs, we need to sort the hits. This specific sort (using the location in the PRG string) just plays a role in a specific case, when we map a read minimizer to a graph that has such minimizer duplicated in two or more places. In this case, we would sort these hits further by their location in the PRG string, but now has been changed to be sorted by the order of the minimizer in the DAG. These two sorts are actually somewhat related, as minimizers that happen earlier in the PRG string have lower id in the minimizer DAG. My personal opinion is that it should not change much the results, and the following 4-way results confirm this. RAM improvement is good, 2.4x less RAM than the previous optimisation (b19d26), allowing us to run roundhound with <10GB. RAM improvements will be detailed in a future post, I am still gathering benchmarks.

Details

Detailed 4-way results follow, only for filtered data. Here we compare the latest release (0.10.0-alpha.0), the version described in the previous post (b19d26, with lazy loading and read data optimisation), and this version under study (1c53eb, which adds an index optimisation):

Illumina data

The most improved version, 1c53eb, actually slightly improves precision for the illumina results, both with and without denovo.

image

Nanopore data

The curves for both improved versions, b19d26 and 1c53eb basically overlap, which means that the index optimisation done in 1c53eb do not introduce any bugs:

image

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 14, 2023

The previous post shows that the improvements we've done do not introduce bugs to pandora and we can thus merge. The merge will consist of the 5 PRs (the 1st one is large, the other are small increments):

  1. Lazy loading and random multimapping: adds the lazy loading feature and randomly maps reads when their best mapping are to two or more genes;
  2. No coverage filtering: removes the hard-coded coverage filtering in pandora;
  3. Read info optimisation: do not keep heavy mapping info for each mapped read. Process this info and release the memory as soon as possible;
  4. Index optimisation: remove where each minimizer appear in each PRG string from the index. Sort duplicated minimizer matches using the node id in the minimizer DAG;
  5. Miscellaneous: small changes (code cleaning, refactoring, formatting, etc) and prepare code for next release;

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 14, 2023

I've also removed RAM values from the previous posts, and I am gathering benchmarking data on how all these improvements reduced RAM. I will update this issue as soon as I get all benchmarking data.

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 14, 2023

RAM and runtime improvements

History of RAM and runtime improvements for the new version of pandora that will be merged in the next PRs. These benchmarks were done running pandora compare with the RH plasmid DB (~1M PRGs) and the ESBL sample SRR16977031:

  1. v0.10.0-alpha.0 (baseline, current release)
    RAM usage: 178.1 GB
    Runtime: 130 minutes

  2. commit a76df4 (only lazy loading added - this is the version we've been using in RH, unreleased):
    RAM usage: 124.5 GB (30% less RAM than baseline)
    Runtime: 31.8 minutes (4 times faster than baseline)

  3. commit b19d26 (lazy loading + read info optimisation, unreleased):
    RAM usage: 22.1 GB (88% less RAM than baseline)
    Runtime: 13 minutes (10 times faster than baseline)

  4. commit 1c53eb (lazy loading + read info optimisation + index optimisation, unreleased):
    RAM usage: 9.1 GB (95% less RAM than baseline)
    Runtime: 8.35 minutes (15.5 times faster than baseline)

Thus when finishing all merges, we will have a version that requires 95% less RAM than current release (~20x improvement on RAM usage) and runs 15.5 times faster than current release.

Details

LSF logs follow:

Pandora benchmarking:

1c53eb (lazy loading + reads optimisation + paths optimisation):
Resource usage summary:
    CPU time :                                   4764.10 sec.
    Max Memory :                                 9345 MB
    Average Memory :                             7826.94 MB
    Total Requested Memory :                     80000.00 MB
    Delta Memory :                               70655.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                20
    Run time :                                   501 sec.
    Turnaround time :                            511 sec.


b19d26 (lazy loading + reads optimisation):
Resource usage summary:
    CPU time :                                   4771.61 sec.
    Max Memory :                                 22644 MB
    Average Memory :                             19645.91 MB
    Total Requested Memory :                     80000.00 MB
    Delta Memory :                               57356.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                20
    Run time :                                   781 sec.
    Turnaround time :                            854 sec.


a76df4 (only lazy loading - version we've been using in RH):
Resource usage summary:
    CPU time :                                   13056.05 sec.
    Max Memory :                                 127450 MB
    Average Memory :                             97789.90 MB
    Total Requested Memory :                     150000.00 MB
    Delta Memory :                               22550.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                20
    Run time :                                   1909 sec.
    Turnaround time :                            1911 sec.


v0.10.0-alpha.0 (baseline):
Resource usage summary:
    CPU time :                                   26317.66 sec.
    Max Memory :                                 182410 MB
    Average Memory :                             84832.14 MB
    Total Requested Memory :                     1024000.00 MB
    Delta Memory :                               841590.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                20
    Run time :                                   7773 sec.
    Turnaround time :                            7782 sec.

@iqbal-lab
Copy link
Collaborator

bloody hell @leoisl

@iqbal-lab
Copy link
Collaborator

for future readers, RH=roundhound.

@mbhall88
Copy link
Member

FAR OUT 🔥

@rmcolq
Copy link
Collaborator

rmcolq commented Jun 15, 2023

These results are unbelievable!! Amazing

@leoisl
Copy link
Collaborator Author

leoisl commented Aug 17, 2023

Closed via #331, #337, #342 and #345

@leoisl leoisl closed this as completed Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants