Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: use core Collection and Select for sig loading #197

Merged
merged 58 commits into from
Feb 13, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Jan 25, 2024

This PR replaces signature loading code with calls to supported functions from sourmash core 0.12.1. It also standardizes csv writing for most functions and standardizes sig loading error handling.

Before this PR, signature loading used custom preparation functions to load all signatures into memory (selecting those with matching parameters). For zipfile reading, we had a hacky implementation that copied all files out of the zip into a temporary location first, which then allowed us to load them properly.

With this PR, we now use sourmash Collection and Select for signature loading. For zip files, this allows lazy loading for signatures, meaning we do all parameter selection via the manifest and only load signatures into memory when we actually want to use them. Downsampling happens at the time of signature loading, either via Select or passing downsample to count_common or similar. For sig or pathlist input, we first load all signatures into memory and generate collection with manifest. We then continue as before, selecting on the manifest first to remove signatures that don't match user-specified parameters (ksize, scaled).

Note that since default core loading of sigs and pathlists did not allow failed sigs, I reimplemented loading here to allow for skipped and failed paths.

Closes #196.

Requirements

Requires changes from:

I think we should release a new core for this and use it here prior to merge.

Functionality Changes

  • we now select compatible signatures while loading collections (via manifest select), and then downsample when we actually access the signature/sketch.
  • Most functions standardized to exit gracefully (as was done in fastgather) if no sigs are loaded for either query or against. This is now checked and handled as part of collection loading.
  • In non-indexed fastmultigather, query-based prefetch/gather output filenames change slightly. I realized after changing over to query.name that we do have access to record.internal_location(). However, this doesn't always exist? Would appreciate your thoughts on the new structure @ctb. Can switch back if you think it's needed/better.

New functions:

  • load_collection: Unified signature Collection loading from zipfile, pathlist, or sigs (no more hacky zip loading!!! and use manifests for selection!)
  • report_on_collection_loading: report n_failed, n_skipped, n_loaded. replaces report_on_sketch_loading.
  • build_selection: replaces build_template. Builds a selection object we can use to select by ksize, scaled, etc

Modified Utility Functions

  • load_sketches - now loads all sketches from a collection into memory as SmallSignature objects, downsampling along the way. Used anywhere we load a full set of minhashes into memory (pairwise, multisearch, manysearch, fastgather, fastmultigather)

These were modified to use collections and SigStore/Signature rather than SmallSignature for places where we load a single signature at a time:

  • write_prefetch
  • load_sketches_above_threshold
  • consume_query_by_gather

Minor Refactoring

  • Use *Result structures where they were missing to standardize writing and make adding columns easy
  • use the standardized thread opening/header writing that comes with using *Result objects with csv_threadwriter
  • use serde::Serialize to write structs rather than bespoke traits.

Deleted Functions

  • load_sigpaths_from_zip
  • load_sketches_from_zip
  • load_sigpaths_from_zip_or_pathlist
  • load_sketches_from_zip_or_pathlist
  • load_sketchlist_filenames
  • build_template
  • check_compatible_downsample
  • report_on_sketch_loading
  • prepare_query

Punted Issues

Questions (Resolved in review)

Do we want functions to exit gracefully if no signatures are loaded for query or against, or do we want it to fail? It was done differently between different subcommands. I standardized to exit gracefully (as was done in fastgather), but happy to change back.

@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 13, 2024

@ctb ok - updated description, added a test for names with spaces, and made a few doc changes.

Some things you may want to think on before merge:

  1. When loading pathlists, we can't actually lazy load. We need to first load all signatures into memory in order to build a collection. I'm not sure what the performance implications are for the different search commands - memory more likely to be an issue for very large collections.

  2. Just trying to confirm you understood what I meant re:file naming changing for non-indexed fastmultigather: the doc changes might better help you see what I mean -- {basename} essentially got swapped out for {signame}, since I didn't know how to access the file basename from a sig in a collection.

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

@ctb ok - updated description, added a test for names with spaces, and made a few doc changes.

Some things you may want to think on before merge:

  1. When loading pathlists, we can't actually lazy load. We need to first load all signatures into memory in order to build a collection. I'm not sure what the performance implications are for the different search commands - memory more likely to be an issue for very large collections.

This is as opposed to before, when for some commands, like gather, it would filter them on load, right?

That's ok. It joins regular sourmash in that regard 😓 sourmash-bio/sourmash#1899

  1. Just trying to confirm you understood what I meant re:file naming changing for non-indexed fastmultigather: the doc changes might better help you see what I mean -- {basename} essentially got swapped out for {signame}, since I didn't know how to access the file basename from a sig in a collection.

Yep!

OK, will look at it as soon as I can. Thanks!!

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

build is reporting this; intentional? (UPDATED to use the right branch 😭 )

warning: unused import: `sourmash::selection`
 --> src/utils.rs:5:5
  |
5 | use sourmash::selection;
  |     ^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

@ctb

This comment was marked as outdated.

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

More questions / thoughts:

  • the multisearch docs should be changed to emphasize zip files. This can be punted to an issue.
  • does the switch to Collection mean that manysearch (formerly known as sra_search) will attempt to load all the metagenomes into memory, rather than loading them one at a time? If so that seems like a problem, since the original reason for manysearch was precisely to do petabase scale search of metagenomes, which we would no longer be able to do.
  • "Notes on concurrency and efficiency" and the index section below might also need to be revisited. Are we now recommending indexes?

@ctb

This comment was marked as outdated.

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

(turns out I was looking at the wrong branch 😅 - updating things above now!)

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

'fastgather' hangs when using a rocksdb:

% sourmash scripts fastgather SRR606249.trim.k31.sig.gz podar-ref.rocksdb -o g.csv

== This is sourmash version 4.8.7.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.8.7.dev0; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000
gathering all sketches in 'SRR606249.trim.k31.sig.gz' against 'podar-ref.rocksdb' using 8 threads
Reading query(s) from: 'SRR606249.trim.k31.sig.gz'
WARNING: skipped 2 query paths - no compatible signatures.
Loaded 1 query signature(s)
Reading search(s) from: 'podar-ref.rocksdb'

...waiting for over a minute now.

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

ok, I think those are my main questions/comments now (I hid the outdated ones from using the wrong branch 😊 )

@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 13, 2024

'fastgather' hangs when using a rocksdb:

I disabled trying to load rocksdb within load_collection to prevent using rocksdb files outside of fastmultigather and manysearch, which are the only commands that can use them so far.

does the switch to Collection mean that manysearch (formerly known as sra_search) will attempt to load all the metagenomes into memory, rather than loading them one at a time? If so that seems like a problem, since the original reason for manysearch was precisely to do petabase scale search of metagenomes, which we would no longer be able to do.

Actually, it's not nearly as bad as I was thinking - I just needed to think about InnerStorage for a minute. So the sigs are read into memory, but only while building the manifest. A Collection is made up of Manifest and InnerStorage, which is a wrapper for MemStorage, FSStorage, or ZipStorage. When reading a pathlists, we store using FSStorage, which just stores the filepaths. I think this means the sigs themselves are just read in to build the records for the Manifest, and then immediately dropped. Then they are loaded again later from the Collection. This certainly incurs memory and time cost, but nothing like keeping the sigs around for the entirety of search.

I think we can also solve this by allowing manifest input to avoid needing to regenerate, added in #211.

"Notes on concurrency and efficiency" and the index section below might also need to be revisited. Are we now recommending indexes?

Only pathlist sig handling changed re: indexed vs non-indexed search. So, we should recommend zip files over pathlists to avoid the extra work of generating the manifest. With #211, we can recommend manifests or zip files over pathlists.

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

created issue:

I think we should go ahead and merge this and then wait for #211 and #212 before releasing a new version?

@ctb
Copy link
Collaborator

ctb commented Feb 13, 2024

(I think the zip file loading should win over everything else ;)

@bluegenes
Copy link
Contributor Author

I think we should go ahead and merge this and then wait for #211 and #212 before releasing a new version?

(I think the zip file loading should win over everything else ;)

both sound good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use updated sourmash core code more broadly
3 participants