-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support low-memory search of many on-disk signatures #1641
Comments
as a side note, once the rust layer can load from .zip files, I think we can support both greyhound and MAGsearch using Python parallelism, if we really wanted to. greyhound will still probably be much more efficient directly in Rust (b/c parallel search of many small signatures, in memory), but this offers a way to make MAGsearch (parallel search of many large signatures, loading from disk as needed) work across multiple cluster machines with e.g. snakemake parallelism. |
Together with #1619, the attached file create-sqlite3-mom.py.txt will create a sqlite3 database with a "manifest of manifests" loaded from a collection of index files (.sig, .zip, whatever). You can then run the following code with that sqlite3 database. import sys
from sourmash.index import LazyMultiIndex, LazyLoadedIndex
from sourmash.manifest import ManifestOfManifests
# load manifest entries from a SQLite database
m = ManifestOfManifests.load_from_sqlite(sys.argv[1])
print(len(m))
# create lazy-loaded index objects that will load from the given location.
# note: this doesn't actually open the files or anything, since manifests
# are provided.
idx_list = []
for l, m in m.locations_and_manifests():
idx = LazyLoadedIndex(l, m)
idx_list.append(idx)
# this first 'len' will just look at manifests
db = LazyMultiIndex.load(idx_list)
print('combined length:', len(db))
# similarly, this select will just operate on manifests
print('select ksize=31')
db = db.select(ksize=31)
print('combined length:', len(db))
# finally, we actually load the signatures!
print('actual loaded signatures:', len(list(db.signatures()))) |
I'm going to say that #1891 will close this. 🎉 |
Functionality like MAGsearch requires lazy loading, because not all the signatures can be loaded into memory at the same time. See e.g. #1096 for some discussion, and also #1097.
Now that manifests #1590 exist, and there are multiple classes that have manifests and support efficient random access loading of signatures (
ZipFileLinearIndex
andSBT
both!), and we support saving to zipfiles directly from withinsourmash sketch
, we have many more options 🎉 .(As a side note, one of the inefficiencies with #1097 is that the .sig files computed by wort and used by MAGsearch contain multiple signatures for each input (because multiple ksizes, and maybe moltypes). So .sig file loading for these is slower than it needs to be! Zip file linear collections are a better choice for storing things!)
So anyway @bluegenes and I were thinking about what we need to do to support use cases like MAGsearch (search 500,000 large metagenome signatures with a query!) via the Python interface (so, with picklists etc.), and came up with a fairly simple set of ideas that is mostly supported by existing functionality.
The plan is, briefly,
as a random aside, I wonder if sqlite might be a good format for the manifest-of-manifests file?
#1619 gets us 90% of the way there, we "just" need the manifest-of-manifests functionality, which is pretty simple.
a simple code example with
LazyMultiIndex
Using the
LazyMultiIndex
code in #1619, we can do the following quite easily and quickly; takes under a second for 64 files.(This uses the podar-ref collection of genomes, https://osf.io/vbhy5/.)
a few points of interest -
sourmash sig cat {fn}.fa.sig -o {fn}.zip
. This builds the manifests as well.LazyMultiIndex.load(...)
just takes the indices and builds a list of their manifests.mi.search(...)
; everything else is done on the manifests only.The text was updated successfully, but these errors were encountered: