You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Over in #1511 (comment), I've been building new LCA databases for GTDB rs202, and (after resolving the duplicates problem in #1568 :) I was still getting some missing identifiers; for k=31, with updated identifier code based on #1542, the following command:
WARNING: 7520 duplicate signatures.
WARNING: no signatures for 7520 spreadsheet rows.
WARNING: 7520 unused identifiers.
In tracking this down, I realized that what was happening was that signatures that were identical at the given scaled/ksize were being discarded before their identifiers were recorded.
I'm not sure there's anything to be done about this but I wanted to track it, anyway.
The text was updated successfully, but these errors were encountered:
# block off duplicates.ifsig.md5sum() inmd5_to_name:
debug('WARNING: in file {}, duplicate md5sum: {}; skipping', filename, sig.md5sum())
record_duplicates.add(sig.name)
continue
and I'm wondering if our thinking has evolved due to #1501 and some of the stuff with sourmash tax? We handle duplicate signatures just fine in .zip files now, for example... I'm tempted to remove this check and just say that we're ok with duplicate md5s.
Over in #1511 (comment), I've been building new LCA databases for GTDB rs202, and (after resolving the duplicates problem in #1568 :) I was still getting some missing identifiers; for k=31, with updated identifier code based on #1542, the following command:
yielded:
In tracking this down, I realized that what was happening was that signatures that were identical at the given scaled/ksize were being discarded before their identifiers were recorded.
I'm not sure there's anything to be done about this but I wanted to track it, anyway.
The text was updated successfully, but these errors were encountered: