Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash prefetch is ~10 times slower than sourmash gather?! #2124

Closed
ctb opened this issue Jul 18, 2022 · 4 comments · Fixed by #2132
Closed

sourmash prefetch is ~10 times slower than sourmash gather?! #2124

ctb opened this issue Jul 18, 2022 · 4 comments · Fixed by #2132

Comments

@ctb
Copy link
Contributor

ctb commented Jul 18, 2022

following the work done on #1771 with some speed fixes merged in #2123, I decided to run sourmash prefetch with the same query/database:

sudo time py-spy record -o latest.svg -- sourmash prefetch SRR10988543.10k.zip bins.10k.zip >& prefetch.out

and it took 600 seconds!

Not 100% sure what's going on here but it might be related to some of the ANI work that @bluegenes has been doing.

Flamegraph:

Screen Shot 2022-07-18 at 10 04 16 AM

@bluegenes
Copy link
Contributor

bluegenes commented Jul 18, 2022

hmmm -- I was a bit worried about add'l calculations when adding ANI stuff.

Now that we're outputting all the information we need for ANI estimation (ksize, scaled, n_dataset_hashes, etc) in the prefetch csv file, we could actually skip ANI estimation entirely during prefetch

We could:

  1. only report ANI for gather
  2. separately annotate a prefetch or gather csv file with ANI results, if desired (akin to running sourmash tax annotate on a gather file, this would just add ANI estimates).

@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2022 via email

@bluegenes
Copy link
Contributor

bluegenes commented Jul 18, 2022

👍🏻!

I think for most normal use cases (prefetch --> gather), folks might not want ANI results/ might only care about ANI to gather results. So the speedup might be worth removing ANI from prefetch (unless someone can just make it faster, ofc). I implemented it this way bc I've been using multiple prefetch calls to get all-by-all comparisons, and originally did not have sufficient info in the prefetch csv.

Removing ANI from prefetch is straightforward (for me/you for future ref):

@ctb
Copy link
Contributor Author

ctb commented Jul 20, 2022

#2132 seems to fix this, or at least speed it up by a lot...

@ctb ctb closed this as completed in #2132 Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants