Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNA moltype output in manifest is dna not DNA #49

Closed
ctb opened this issue Jun 7, 2024 · 1 comment · Fixed by sourmash-bio/sourmash#3193 or #52
Closed

DNA moltype output in manifest is dna not DNA #49

ctb opened this issue Jun 7, 2024 · 1 comment · Fixed by sourmash-bio/sourmash#3193 or #52

Comments

@ctb
Copy link
Contributor

ctb commented Jun 7, 2024

When I run the quickstart and then unzip the zip file and look at the manifest,

unzip -p test-gbsketch.zip SOURMASH-MANIFEST.csv

I see that the moltype is dna, not DNA.

# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/8196cc69b814cab9fc3c7e9eed861bf6.sig.gz,8196cc69b814cab9fc3c7e9eed861bf6,8196cc69,31,dna,0,1000,2558,1,GCA_000175555.1 ACUK01000506.1 Saccharolobus solfataricus 98/2,GCA_000175555.1_genomic.fna.gz

It should be DNA, all caps.

This appears to be one of the things confusing sourmash over in sourmash-bio/sourmash#3191.

I'll discuss the sourmash fails here over in that issue, but the moltype fix maybe belongs here? I can't quite figure out where it's happening - maybe the Rust manifest code in sourmash needs to be fixed?

ctb added a commit to sourmash-bio/sourmash that referenced this issue Jun 9, 2024
…d by plugins (#3193)

This PR fixes a bug in `disk_revindex.rs::RevIndexOps::gather` where
RocksDB-based `gather` was
incorrectly subtracting hashes multiple times from the counter in
situations of high redundancy.

For example, consider this Venn diagram of the 3-way intersection
between three sketches:


![image](https://github.com/sourmash-bio/sourmash/assets/51016/f98bf6a0-acc4-415f-bf39-1c48a599996a)

When a metagenome contains the union of all three of these sketches, the
broken implementation would subtract the `10` at the center multiple
times. This was caused by removing hashes from the matches, rather than
the intersection, each pass through the counter.

Of note, this made RocksDB-based `fastmultigather` return incorrect
results, ref
sourmash-bio/sourmash_plugin_branchwater#322;
first discovered in
#3138 (comment).

The PR fixes this, and adds a more complete pair of tests, based on
`test_gather_metagenome_num_results` in the Python tests for sourmash.

This PR also adjusts the hash function string for DNA sketches in Rust
to be uppercase `DNA` rather than lowercase `dna`, ref
sourmash-bio/sourmash_plugin_directsketch#49

And remember, it's not *just* the destination - it's the friends you
make along the way, like `env_logger`.

* Fixes #3139
* Fixes
sourmash-bio/sourmash_plugin_directsketch#49

For consideration:

Right now we are calculating the intersection twice, once in
`disk_revindex.rs` and once in `calculate_gather_stats` in
`revindex/mod.rs`. This is unnecessary, of course. But the function
signature for `calculate_gather_stats` would need to change to take the
intersection as an argument. We could:
* keep calculating it twice, just for simplicity;
* change `calculate_gather_stats` to take an _optional_ intersection,
and calculate it if not provided;
* change `calculate_gather_stats` to require an intersection.

TODO:
- [x] add at least one explicit test for the moltype fix

Other notes:

* there is a discrepancy between the Python (`sourmash gather`) and Rust
(`sourmash_plugin_branchwater` results) calculations for `remaining_bp`.
It seems to me like the Python one is definitely wrong; not yet sure
about Rust. Viz #3194.
@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

This was fixed in sourmash-bio/sourmash#3193. We now need to wait for a new release of sourmash-rs core, r0.14.0, and then update Cargo.toml here and cut a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant