-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Rework the find
functionality for Index
classes
#1392
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1392 +/- ##
==========================================
+ Coverage 89.58% 89.71% +0.13%
==========================================
Files 122 123 +1
Lines 18989 19464 +475
Branches 1455 1483 +28
==========================================
+ Hits 17011 17463 +452
- Misses 1750 1775 +25
+ Partials 228 226 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
find
functionality for Index
classesfind
functionality for Index
classes
OK, I think this is done, provisionally. |
yeesh! that's not good! I wonder if |
(there's no obvious algorithmic reason for the time or memory to go up) |
All set calculations are done in Python now: While not algorithmic different, it is a lot of memory copying to pull hashes out of Rust, create sets in Python, and then calculate... |
ahh! good point! and that's pretty easy to fix with |
(great job on finding that, I struggled to get that code (a) working and then (b) clean and (c) tested, so now it's time for (d) optimization 😂) |
#1474 provides a |
note to self: @bluegenes and I want to add a this would enable #985 and #849 more generically. Edit: added in #1477 |
* add MinHash.intersection method * rearrange order of intersection * swizzle SBT search code over to using Rust-based intersection code, too * add intersection_and_union_size method to MinHash * make flatten a no-op if track_abundance=False * intersection_union_size in the FFI Co-authored-by: Luiz Irber <luiz.irber@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already a large PR, and there are PRs merging more code into this, so I vote for merging now and rebasing the other ones.
Overall a nice cleanup of the codebase, no obvious performance regressions, and I think it mostly fits with future indices written in Rust.
"They who control
find
control the universe."This PR implements a common
Index.find
generator function for Jaccard similarity and containment onIndex
classes.This new generator function takes a
JaccardSearch
object and a query signature as inputs, and yields all signatures that meet the criteria.The
JaccardSearch
object insrc/sourmash/search.py
is the workhorse forfind
; it scores potential matches (and can truncate searches) based onquery_size
,intersection
,subject_size
, andunion
, which is sufficient to calculate similarity, containment, and max_containment.This provides some nice simplifications -
Index.search
andIndex.gather
can be implemented generically on top offind
, and are now methods on the baseIndex
class;prefetch
functionality on top of all of ourIndex
subclasses (SBT and LCA as well), ref [MRG] refactorgather
functionality for speed & modularity; provideprefetch
functionality. #1370 and [EXP] add aprefetch
linear search function toIndex
#1371 and other makeshift strategies for large scale database search - the "greyhound" issue #1226;sbtmh.py
is no longer needed;Moreover,
The overall result is a pretty good code consolidation and simplification.
This PR also:
sourmash search
functionality into flat queries (uses newsearch_fn
functionality, works on SBT and LCA too) and abund queries (only works onLinearIndex
)sourmash categorize
to work with more than SBTs by usingload_file_as_index
for the database (Upgradesourmash categorize
to take LCA databases as well as SBT. #829)location
property toIndex
classes (Request thatIndex
provide a location? #1377)namedtuple
for results fromIndex.search
andIndex.gather
that improves code readability;Specifically, this PR:
Fixes #829 -
sourmash categorize
now takes all database typesFixes #1377 - provides a
location
property onIndex
objectsFixes #1389 -
--best-only
now works for similarity and containmentFixes #1454 -
sourmash index
now flattens SBT leaves.Larger thoughts:
sourmash index
does not flatten the signatures when building an SBT #1454 the question is asked, can we reliably do 'flat' Jaccard searches to discover good candidates for abundance matches (e.g. angular similarity)? Maybe worth creating a new issue.Index.find
is probably the key thing to implement forprefetch
(or at least that's how I reinterpreted his comment :).Questions and comments for reviewers
search
andgather
methods in theIndex
class, and thefind
functions in the variousIndex
subclasses, are really the key changes in this PR.sourmash categorize
took advantage of this to return angular similarity calculations when the query signature had abundances.sourmash index
does not flatten the signatures when building an SBT #1454.--ignore-abundance
explicitly, which should steer people in the right direction and minimize the impact of this.min_n_below
. This may be fixed by [WIP] Remove min_n_below from search code #1137, but I think it's out of scope for this PR :)TODO items:
sbt.py
andlca_db.py
test_sbt.py
lines 419 and 458, why are they not run!?test_search.py
test_sbt_categorize_ignore_abundance_2
IndexSearchResult
namedtuple attributes.categorize
codeLoadSingleSignature
refactor LoadSingleSignatures? #1077get_search_obj
functionfind