-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] SBT feature: combine trees #143
Conversation
Codecov Report
@@ Coverage Diff @@
## master #143 +/- ##
==========================================
+ Coverage 85.37% 85.52% +0.14%
==========================================
Files 13 13
Lines 1833 1879 +46
Branches 52 52
==========================================
+ Hits 1565 1607 +42
- Misses 259 263 +4
Partials 9 9
Continue to review full report at Codecov.
|
@viehwegerlib wanna take this branch for a field test with your SBTs? =] If your trees are from sourmash_lib import signature
from sourmash_lib.sbt import SBT
from sourmash_lib.sbtmh import SigLeaf, search_minhashes
sig = next(signature.load_signatures('sig_file.sig'))
tree_viral = SBT.load("viral.sbt.json", leaf_loader=SigLeaf.load)
tree_microbial = SBT.load("microbial.sbt.json", leaf_loader=SigLeaf.load)
results_viral = {str(s) for s in tree_viral.find(search_minhashes, sig, 0.1)}
results_microbial = {str(s) for s in tree_microbial.find(search_minhashes, sig, 0.1)}
new_tree = tree_viral.combine(tree_microbial)
# new_tree and tree_viral are the same thing, from this point on
results = {str(s) for s in new_tree.find(search_minhashes, sig, 0.1)}
assert results == (results_viral | results_microbial) |
pure awesome, will try right now |
All works well till I try to write the tree so disk. How is the .save() method used properly?
|
Oh |
It saves, nice. However, when sbt_searching in the combined tree (microbial + virus) for a signature that I indexed in the virus SBT, nothing similar is found, but when sbt_searching the viral SBT alone, the signature retrieves "itself" as the most similar hit. |
@viehwegerlib hmm. Can you share the trees with me somehow? I tried with viral refseq SBT + bacteria refseq SBT and it worked (virus sig found in both viral and combined tree) |
I also tried to create a chimera (copied 250 hashes from a virus, 250 from a bacteria) and got 3 matches (one viral, two bacterial). in the combined tree. |
I could send you the files via wetransfer.com, where do you want me to send them? |
All's well - that's what I get for coding with little coffee. I made a mistake in preparing the test files. The combined tree works perfect. Very cool, thanks a lot! |
ef93190
to
a99f44d
Compare
sourmash_lib/sbt.py
Outdated
# An internal node, we need to update the name | ||
new_node.name = "internal.{}".format(current_pos) | ||
new_nodes[current_pos] = new_node | ||
if tree.nodes[pos] is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not simple an else
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point! Thanks!
self.nodes[p.pos] = n | ||
c1 = self.children(p.pos)[0] | ||
self.nodes[c1.pos] = node | ||
node.update(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a test to get some coverage here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this happen when using bigger SBTs, I'll try to cook up a test without putting too much data in the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the branch, decide on the two comments, otherwise LGTM
a99f44d
to
819a57e
Compare
This looks ready to merge @luizirber - shall I push the magic button? |
Whups, update from master is nontrivial... |
819a57e
to
f16930a
Compare
Fixes #141
The initial implementation just went thru the leaves in another tree and add to the original one. This is not so efficient because we recreate internal nodes (a lot).
A better solution is to create a new root, make each original tree a child of the new root, and reuse all the internal nodes already calculated. This is implemented now.
Possible improvements:
A more powerful solution would be to do a depth-first search in the original trees and replicate what is found in the new one, this would also support better case
1
.For
2
and3
, we only need that because the currentadd_node
implementation expect the tree to be filled in order, without gaps. Other methods for adding nodes might not need it, tho (I'm thinking about doing a guided insertion, putting signatures closer to places where they have higher similarity, for example).make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?