Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash lca index eliminates identical signatures before recording their identifiers #1573

Open
ctb opened this issue Jun 6, 2021 · 1 comment

Comments

@ctb
Copy link
Contributor

ctb commented Jun 6, 2021

Over in #1511 (comment), I've been building new LCA databases for GTDB rs202, and (after resolving the duplicates problem in #1568 :) I was still getting some missing identifiers; for k=31, with updated identifier code based on #1542, the following command:

sourmash lca index ../../gtdb-rs202.taxonomy.v2.csv FOO-gtdb-rs202.genomic.k31.lca.json.gz \
       gtdb-rs202.genomic.k31.sbt.zip --scaled=10000 --require-taxonomy \
      --fail-on --split-identifier -k 31 --report REPORT.31.txt

yielded:

WARNING: 7520 duplicate signatures.
WARNING: no signatures for 7520 spreadsheet rows.
WARNING: 7520 unused identifiers.

In tracking this down, I realized that what was happening was that signatures that were identical at the given scaled/ksize were being discarded before their identifiers were recorded.

I'm not sure there's anything to be done about this but I wanted to track it, anyway.

@ctb
Copy link
Contributor Author

ctb commented Mar 31, 2022

for some reason I came across this issue and decided to check - yep, it still happens.

The problem is here, https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/lca/command_index.py#L209

            # block off duplicates.
            if sig.md5sum() in md5_to_name:
                debug('WARNING: in file {}, duplicate md5sum: {}; skipping', filename, sig.md5sum())
                record_duplicates.add(sig.name)
                continue

and I'm wondering if our thinking has evolved due to #1501 and some of the stuff with sourmash tax? We handle duplicate signatures just fine in .zip files now, for example... I'm tempted to remove this check and just say that we're ok with duplicate md5s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant