Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lightweight SBT combining/adding for large SBTs #229

Closed
ctb opened this issue May 17, 2017 · 3 comments
Closed

Implement lightweight SBT combining/adding for large SBTs #229

ctb opened this issue May 17, 2017 · 3 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 17, 2017

In response to @meren,

I am sure there is a way to add more genomes (incrementally maybe? or by re-computing the entire thing?) to the database.

We actually can do this in a few different ways —

the heaviest weight way right now is to combine or update the database, which is not that time/resource intensive but is still inconvenient. (The database can be updated mostly incrementally; it’s a Sequence Bloom Tree underneath). We have a command line way to do this with ‘sourmash sbt_combine’.

the medium weight way (mostly just frustrating) is to have sbt_gather output unknown bits of the signature. Then you could do iterative search (run sbt_gather on database A, take what remains, run
it on database B, etc.) There are many reasons to support it and it’s very easy so we will probably add it next time I need it myself.

the lightest weight way to do this is not yet supported but is an hour of hacking away - let the sbt_gather and sbt_search commands take multiple SBTs. The SBT search is very lightweight in terms of memory and resources (searching all of gen bank takes seconds and < 500 MB of RAM) and so simply doing 2x or 3x of them on multiple databases and then massaging the results is not difficult. But I am trying to be a bit careful about complexifying the command line so am hesitant to blindly add it. Easy to do once we need it, tho.

@luizirber
Copy link
Member

On top of sbt_combine there is also PR 120 for adding a --append option to sbt_index, which would open an existing SBT and add new signatures to it.

And I actually think implement the three options are useful, they cover many different use cases.

@ctb
Copy link
Contributor Author

ctb commented May 19, 2017

#240 adds the second and third options - you can now do --output-unassigned to get unassigned hashes in a single signature, and you can do sourmash gather query.sig sbt1 sig2 sbt3 sig4 to search/gather multiple SBTs and signatures.

@ctb
Copy link
Contributor Author

ctb commented May 21, 2017

Closed by #120 and #240.

@ctb ctb closed this as completed May 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants