Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index optimisation #342

Merged
merged 26 commits into from
Jul 20, 2023
Merged

Index optimisation #342

merged 26 commits into from
Jul 20, 2023

Conversation

leoisl
Copy link
Collaborator

@leoisl leoisl commented Jul 19, 2023

The main feature in this PR is removing the prg::Path path attribute from the MiniRecord struct. The prg::Path represents the path of nodes the minimiser stems from in the linear string representation of the PRG. It is used to sort minimiser hits when a minimiser from a read multimaps to several places in a graph. However, we don't take into account the places the minimiser was hit in the graph when defining and filtering clusters of hits. Therefore, the usefulness of this attribute is limited. Furthermore, a good approximation to this attribute is to instead use the kmer_node_id in the Kmer Graph of the PRG, as prg::Paths that appear earlier in the PRG string will likely have lower ids as the ids correspond to the topological sort order.

Thus, I don't expect much difference in output/results when replacing prg::Path by kmer_node_id in MiniRecord. This was also the observation when running pandora on the 4-way data (see #330 (comment)). Illumina results were actually slightly better with this change, while nanopore results remained the same. The motivation for this change is that it allows us to significantly reduce RAM usage and also improves runtime (e.g.: improves RAM usage by 60% and runtime by 35% in the roundhound use case for a sample).

This is the final optimisation PR regarding roundhound use cases. Compared with the current pre-release, this is the benchmark:
pandora compare with the RH plasmid DB (~1M PRGs) and the ESBL sample SRR16977031:

  • v0.10.0-alpha.0 (baseline, current release)
    RAM usage: 178.1 GB
    Runtime: 130 minutes

  • All the recent PRs (lazy loading + read info optimisation + index optimisation, unreleased):
    RAM usage: 9.1 GB (95% less RAM than baseline)
    Runtime: 8.35 minutes (15.5 times faster than baseline)

This PR also features some very few minor stuff, e.g.:

  • passing some shared pointers by reference instead of copy;
  • not changing error rate if user gave bad params (e.g. --illumina and high error rate);

There will still be a final PR after this one doing several very minor fixes, and then finally a last one to prepare the pre-release.

Copy link
Collaborator

@iqbal-lab iqbal-lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything in here that worries me, and the test on the 4-way is also convincing

@leoisl leoisl merged commit fde2ea4 into iqbal-lab-org:dev Jul 20, 2023
1 check passed
@leoisl leoisl deleted the index_optimisation branch November 2, 2023 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants