Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support low-memory search of many on-disk signatures #1641

Closed
ctb opened this issue Jun 26, 2021 · 3 comments · Fixed by #1891
Closed

support low-memory search of many on-disk signatures #1641

ctb opened this issue Jun 26, 2021 · 3 comments · Fixed by #1891

Comments

@ctb
Copy link
Contributor

ctb commented Jun 26, 2021

Functionality like MAGsearch requires lazy loading, because not all the signatures can be loaded into memory at the same time. See e.g. #1096 for some discussion, and also #1097.

Now that manifests #1590 exist, and there are multiple classes that have manifests and support efficient random access loading of signatures (ZipFileLinearIndex and SBT both!), and we support saving to zipfiles directly from within sourmash sketch, we have many more options 🎉 .

(As a side note, one of the inefficiencies with #1097 is that the .sig files computed by wort and used by MAGsearch contain multiple signatures for each input (because multiple ksizes, and maybe moltypes). So .sig file loading for these is slower than it needs to be! Zip file linear collections are a better choice for storing things!)

So anyway @bluegenes and I were thinking about what we need to do to support use cases like MAGsearch (search 500,000 large metagenome signatures with a query!) via the Python interface (so, with picklists etc.), and came up with a fairly simple set of ideas that is mostly supported by existing functionality.

The plan is, briefly,

  • for many samples, store signatures in .zip files by sample ID; each zip file contains a manifest of signatures.
  • maintain a separate "manifest of manifests" file that contains all the manifest entries, together with their originating sample ID/zip file;
  • as new signatures are calculated, update the manifest of manifests file appropriately;
  • do searches using this manifest-of-manifests file to figure out which zip file and signature(s) to retrieve for each sample.

as a random aside, I wonder if sqlite might be a good format for the manifest-of-manifests file?

#1619 gets us 90% of the way there, we "just" need the manifest-of-manifests functionality, which is pretty simple.

a simple code example with LazyMultiIndex

Using the LazyMultiIndex code in #1619, we can do the following quite easily and quickly; takes under a second for 64 files.

(This uses the podar-ref collection of genomes, https://osf.io/vbhy5/.)

import glob
import sourmash
from sourmash.index import LazyMultiIndex

zipfiles = glob.glob('podar-ref/*.zip')
print(f'loading {len(zipfiles)} zipfiles...')
idxlist = [ sourmash.load_file_as_index(z) for z in zipfiles ]
print('...done')
mi = LazyMultiIndex.load(idxlist)

q = sourmash.load_one_signature('podar-ref/1.fa.sig', ksize=31)
mi = mi.select(ksize=31)
print(list(mi.search(q, threshold=0)))

a few points of interest -

  • to construct the .zip files from the existing .sig files, I just did sourmash sig cat {fn}.fa.sig -o {fn}.zip. This builds the manifests as well.
  • the LazyMultiIndex.load(...) just takes the indices and builds a list of their manifests.
  • no database signatures are actually loaded until the very last line, the mi.search(...); everything else is done on the manifests only.
  • the database signatures are loaded, queried, and then are available for unload (currently via python's default garbage collection, but there's a way to force that if we care).
@ctb
Copy link
Contributor Author

ctb commented Jun 26, 2021

as a side note, once the rust layer can load from .zip files, I think we can support both greyhound and MAGsearch using Python parallelism, if we really wanted to. greyhound will still probably be much more efficient directly in Rust (b/c parallel search of many small signatures, in memory), but this offers a way to make MAGsearch (parallel search of many large signatures, loading from disk as needed) work across multiple cluster machines with e.g. snakemake parallelism.

@ctb
Copy link
Contributor Author

ctb commented Jun 27, 2021

Together with #1619, the attached file create-sqlite3-mom.py.txt will create a sqlite3 database with a "manifest of manifests" loaded from a collection of index files (.sig, .zip, whatever). You can then run the following code with that sqlite3 database.

import sys
from sourmash.index import LazyMultiIndex, LazyLoadedIndex
from sourmash.manifest import ManifestOfManifests

# load manifest entries from a SQLite database                                  
m = ManifestOfManifests.load_from_sqlite(sys.argv[1])
print(len(m))

# create lazy-loaded index objects that will load from the given location.      
# note: this doesn't actually open the files or anything, since manifests       
# are provided.                                                                 
idx_list = []
for l, m in m.locations_and_manifests():
    idx = LazyLoadedIndex(l, m)
    idx_list.append(idx)

# this first 'len' will just look at manifests                                  
db = LazyMultiIndex.load(idx_list)
print('combined length:', len(db))

# similarly, this select will just operate on manifests                         
print('select ksize=31')
db = db.select(ksize=31)
print('combined length:', len(db))

# finally, we actually load the signatures!                                     
print('actual loaded signatures:', len(list(db.signatures())))

@ctb
Copy link
Contributor Author

ctb commented Mar 26, 2022

I'm going to say that #1891 will close this. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant