Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we make some of the 'sourmash signature' functions work on streams? #609

Closed
ctb opened this issue Jan 8, 2019 · 4 comments
Closed

Comments

@ctb
Copy link
Contributor

ctb commented Jan 8, 2019

ref #587. e.g. sourmash sig describe could certainly be made streaming; would be interesting to explore for other commands, as an alternative solution to adding detailed/specific command line parameters as in e.g. #560.

@ctb
Copy link
Contributor Author

ctb commented Jan 9, 2019

thinking more about this, what I'm really after is chaining (ref).

In particular, I'd really like to be able to do something like this:

<trim command that outputs sequences to stdout> | sourmash monitor -o sig1 |
    <another trim/whatever command> | sourmash monitor -o sig2 |
    <streaming assembler> | sourmash monitor -o sig3

which would let us watch loss of k-mers etc from these commands. (I'm already using this kind of thing in spacegraphcats to monitor information loss at various steps.)

More specifically to this issue, I have visions of

sourmash sig extract --md5=... -k 31 |
     sourmash sig rename foo - |
     sourmash sig downsample -o search.sig

or even more ambitiously

sourmash sig extract <list of gather results> |
   sourmash compare -o foo -
  • basically, be able to bypass a bunch of temp files and do various powerfully things on the fly.

cc @standage

@standage
Copy link
Contributor

standage commented Jan 9, 2019

To be clear, you're suggesting a tube that takes in a stream of sequences, maintains a minhash sketch of incoming k-mers, and spits out each sequence unmodified? Sounds super useful, and (although I know little about how sourmash is implemented) should be straightforward I would think.

If you wanted to fully support a continuous data stream, you'd probably want to write the minhash sketch to disk at regular intervals. But I doubt that use case is in high demand just yet. :-)

@ctb
Copy link
Contributor Author

ctb commented Jan 9, 2019 via email

@ctb
Copy link
Contributor Author

ctb commented Jul 3, 2020

I think this was fixed by #1059, which fixed #1049 ... confirmation:

sourmash sig cat tests/test-data/prot/protein/GCA_001593935.1_ASM159393v1_protein.faa.gz.sig tests/test-data/prot/protein/GCA_001593935.1_ASM159393v1_protein.faa.gz.sig | sourmash compare - -o xxx

works just fine.

@ctb ctb closed this as completed Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants