Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560

Closed
olgabot opened this issue Oct 26, 2018 · 11 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Oct 26, 2018

I'd like to be able to calculate ONE "deep" signature of e.g. 10k kmers for each sample and then adjust the scaling factor to 5k kmers, 1k kmers, 500 kmers, 100 kmers, see how that affects the similarity and nearest neighbors.

@ctb
Copy link
Contributor

ctb commented Oct 26, 2018 via email

@olgabot
Copy link
Collaborator Author

olgabot commented Oct 26, 2018

Thanks for the quick response! My issue is that droplet-based methods (~10k reads/cell) are much shallower than full-transcript methods (~500k reads/cell) and I'd like to use the same number of kmers/hashes across both, not just the same scaling factor.

@ctb
Copy link
Contributor

ctb commented Oct 26, 2018 via email

@luizirber
Copy link
Member

would something like #538 help?

@olgabot
Copy link
Collaborator Author

olgabot commented Oct 31, 2018

@luizirber Yes, it would be very helpful to be able to switch between --scaled and --num-hashes! I have many many (10/cell, 1000 cells) that are essentially redundant because they used different scales and I didn't notice the --scaled feature in compare before.

@ctb
Copy link
Contributor

ctb commented Oct 31, 2018 via email

@olgabot
Copy link
Collaborator Author

olgabot commented Oct 31, 2018

that would be very helpful! and definitely eliminate the number of redundant computations I'm doing

@ctb
Copy link
Contributor

ctb commented Dec 27, 2018

sorry for dropping the ball on this...

but, looking at #587, I wonder if this functionality fits under downsample? I could imagine adding a flag to the command like so:

sourmash signature downsample --num 500 file1.sig

and/or

sourmash signature downsample --scaled 1000 file1.sig

where --num and --scaled are (for the moment) incompatible, and fail when the downsampling cannot be done properly. Whaddya think?

@ctb
Copy link
Contributor

ctb commented Dec 27, 2018

Also, in #436 I added some example Python code :)

@ctb
Copy link
Contributor

ctb commented Jul 3, 2020

#1072 would resolve this, I think.

@ctb
Copy link
Contributor

ctb commented Sep 23, 2021

The core functionality is available in sourmash sig downsample, and the selector framework discussed in #1524 is probably the right place to implement something generic in the future.

Closing until the specific CLI functionality is requested again :)

@ctb ctb closed this as completed Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants