Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add sig fileinfo and standard manifest-generating functionality #1837

Merged
merged 51 commits into from
Feb 24, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Feb 12, 2022

Inspired by the discussion in #1830, this PR adds sourmash sig fileinfo for summary information on a path, and uses that functionality as inspiration to add generic get_manifest(...) internal functionality.

Specifically, this PR:

  • adds sourmash sig fileinfo which provides a useful human-readable summary of a path; see below for example output.
  • provides a new standard function get_manifest(idx, *, require=True, rebuild=False) that can be used to get a manifest for any index object.
  • refactors extract to use manifests, which should be (much) faster for databases
  • updates the Index base class to have __len__

In addition, this PR:

  • adds __len__() and manifest-generating support to LCA_Database
  • documents sourmash_args a bit in the module docstring
  • fixes MultiIndex.location to work properly in all circumstances
  • adds manifest support to ZipFileLinearIndex.__len__, which speeds it up tremendously
  • fixes a general problem with loading manifest column with_abundance from a CSV
  • signatures from stdin are now put in a MultiIndex class with manifest support, instead of just a LinearIndex; this means that all sourmash CLI functions now use Index classes with manifests
  • adds __eq__ to manifest objects so that m1 == m2 is based on content equality, not object equality

Fixes #1190
Fixes #1439

Example output

% == This is sourmash version 4.2.5.dev19+g3a6028fb.d20220217. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'tests/test-data/prot/all.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/tests/test-data/prot/all.zip
is database? yes
has manifest? yes
num signatures: 8
** examining manifest...
31758 total hashes
summary of sketches:
   2 sketches with dayhoff, k=19, scaled=100          7945 total hashes
   2 sketches with hp, k=19, scaled=100               5184 total hashes
   2 sketches with protein, k=19, scaled=100          8214 total hashes
   2 sketches with DNA, k=31, scaled=1000             10415 total hashes

Limitations

sourmash sig extract will not work when used with --picklist on certain database types (LCA DBs, SBTs, and zipfiles w/o manifests). This is because we now convert manifests into picklists to do the extraction and these database types do not support multiple picklists. See #1848 for one possible resolution.

IMO this is a pretty minor limitation so I decided not to complicate this PR by fixing it. Let me know if you disagree :).

See test test_sig_extract_8_picklist_md5_lca for a test that demonstrates this behavior.

Questions

  • should we use sourmash sig summarize instead of, or in addition to, fileinfo? I'm not sure fileinfo is the first place I'd look for this functionality, but it is very descriptive 😆
  • do we want a human-readable "filetype" that's more informative for e.g. .sig files and directories? Right now it just says "MultiIndex" which doesn't really tell you what was loaded (since MultiIndex is used for many things)

TODO

  • provide standard argparse functions to force rebuild of manifest
  • figure out what idx.location is so often None!? (for .sig files and directories)
  • also check is empty/Index.bool and Index.len stuff; update Index spec?
  • add YAML output to fileinfo
  • add tests for get_manifest(...) on various class types
  • add tests for fileinfo on all the index types
  • add tests for MultiIndex.location fix
  • add tests for LCA_Database use of __len__ and new manifest capability
  • add tests for ZipFileLinearIndex proper use of manifests + select
  • add documentation for different database types?
  • mine the comments for issues
  • expand the docstring at top of sourmash_args
  • provide information on combinations of sketches
  • test extract on database types that don't support multiple rounds of picklists
  • look at stdin loading

@codecov
Copy link

codecov bot commented Feb 12, 2022

Codecov Report

Merging #1837 (d4facce) into latest (530d833) will increase coverage by 0.33%.
The diff coverage is 96.80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1837      +/-   ##
==========================================
+ Coverage   81.92%   82.25%   +0.33%     
==========================================
  Files         118      119       +1     
  Lines       12783    12918     +135     
  Branches     1705     1727      +22     
==========================================
+ Hits        10472    10626     +154     
+ Misses       2047     2026      -21     
- Partials      264      266       +2     
Flag Coverage Δ
python 90.31% <96.80%> (+0.37%) ⬆️
rust 65.47% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/lca_db.py 91.27% <81.81%> (-0.32%) ⬇️
src/sourmash/sig/__main__.py 91.75% <96.00%> (+1.09%) ⬆️
src/sourmash/cli/sig/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/sig/fileinfo.py 100.00% <100.00%> (ø)
src/sourmash/cli/sig/manifest.py 100.00% <100.00%> (ø)
src/sourmash/index/__init__.py 96.45% <100.00%> (+0.48%) ⬆️
src/sourmash/manifest.py 90.51% <100.00%> (+0.25%) ⬆️
src/sourmash/sourmash_args.py 93.17% <100.00%> (+0.28%) ⬆️
src/sourmash/sbt.py 85.04% <0.00%> (+2.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 530d833...d4facce. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Feb 15, 2022

note from chatting with @bluegenes : the combinatorial information would be good (this ksize,scaled,moltype => this many sketches)

@ctb

This comment was marked as resolved.

@ctb

This comment was marked as resolved.

@ctb

This comment was marked as resolved.

@ctb

This comment was marked as resolved.

@ctb
Copy link
Contributor Author

ctb commented Feb 22, 2022

Note to all: this adds yaml as a sourmash requirement. I like yaml output, but @luizirber has made the point that YAML is ill-specified elsewhere. Maybe we should make this optional - if you have it, you can use sourmash sig fileinfo --yaml-out ..., if you don't, you can't.

@sourmash-bio/devs ready for initial review! But not merge yet.

@mr-eyes mr-eyes added the dependencies Pull requests that update a dependency file label Feb 23, 2022
@ctb ctb removed the dependencies Pull requests that update a dependency file label Feb 23, 2022
@ctb
Copy link
Contributor Author

ctb commented Feb 23, 2022

removed pyyaml altogether, as an unnecessary complication. Ready for review!

@ctb ctb changed the title [WIP] add sig fileinfo and standard manifest-generating functionality [MRG] add sig fileinfo and standard manifest-generating functionality Feb 23, 2022
Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Left a few minor comments/suggestions.

Quite excited about fileinfo, more unified manifest functionality, and manifest-based db sig extraction 🎉

src/sourmash/sourmash_args.py Outdated Show resolved Hide resolved
tests/test_cmd_signature_fileinfo.py Outdated Show resolved Hide resolved
doc/command-line.md Outdated Show resolved Hide resolved
doc/command-line.md Show resolved Hide resolved
doc/command-line.md Show resolved Hide resolved
@bluegenes
Copy link
Contributor

Closes #1190, #1439

thanks @bluegenes!

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
@ctb
Copy link
Contributor Author

ctb commented Feb 24, 2022

Closes #1190, #1439

hah! Hadn't even noticed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

provide a summarization command for signature collections sourmash index describe (or similar)
3 participants