Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we just use the hash function to flag incompatible signatures, instead of DNA/protein/etc? #751

Open
ctb opened this issue Oct 21, 2019 · 4 comments
Labels
5.0 issues to address for a 5.0 release
Milestone

Comments

@ctb
Copy link
Contributor

ctb commented Oct 21, 2019

here's a random thought for @olgabot @luizirber in particular --

right now our signatures contain entries for

"hash_function": "0.murmur64",
"molecule": "DNA",

where signatures can be flagged as incompatible due to hash function OR molecule type (or other things, like ksize). When @olgabot added dayhoff encoding, it ended up adding a whole bunch more possible incompatibilities (I'm not sure this is saved in the signature JSON currently, tho). And we're hoping to add more such things in the future, with skip-mers and other approaches.

So... I was thinking it would be possible to proliferate hash functions instead of molecule types etc.

the idea would be to create hash_functions enum types like

0.murmur64.DNA
0.murmur64.protein
0.murmur64.dayhoff

that would encode these features.

Then we could get rid of specific molecule type/other flag checks in the signatures and MinHash objects.

Thoughts?

@luizirber
Copy link
Member

I like this idea, especially if we do an actual enum type instead of string-typing new hash functions. Are you planning to do a RFC branch, or should I?

@luizirber
Copy link
Member

luizirber commented Oct 28, 2019

A complication: molecule and hash_function exist in different places in a signature.

[
  {
    "class": "sourmash_signature",
    "email": "",
    "hash_function": "0.murmur64",
    "license": "CC0",
    "signatures": [
      {
        "ksize": 20,
        "max_hash": 0,
        "md5sum": "98f13708210194c475687be6106a3b84",
        "mins": [],
        "molecule": "DNA",
        "num": 1,
        "seed": 42
      }
    ],
    "version": 0.4
  }
]

This complicates having DNA and protein sigs in the same file. Ideally they should be at the same level:

[
  {
    ...
    "signatures": [
      {
        "hash_function": "0.murmur64",
        "molecule": "DNA",
        ...
      }
    ],
    "version": 0.4
  }
]

... and be only one field:

[
  {
    ...
    "signatures": [
      {
        ...
        "hash_function": "0.murmur64_DNA",
      }
    ],
    "version": 0.4
  }
]

Going even further, I would like to see sourmash signatures become an object at the top-level (instead of a list), and different sketches being added to the old signatures field (which becomes sketches). The idea being that you have one signature per original dataset/file/URL, but multiple sketches to represent different aspects of the original data. Something like:

  {
    "class": "sourmash_signature",
    "license": "CC0",
    "filename": "path-to-original-file",
    "ipfs_hash": "optional, but would be cool, huh?",
    "sketches": [
      {
        ...
        "hash_function": "0.murmur64_DNA",
        "sketch_type": "minhash"
      },
      {
        ...
        "hash_function": "0.nthash_DNA",
        "sketch_type": "draff"
      },
      {
        ...
        "hash_function": "0.murmur64_DNA",
         "sketch_type": "hll"
      }
    ],
    "version": 1
  }

Relevant issues:

Relevant PRs:

@luizirber
Copy link
Member

Also:

  • move signature save/loading code to something closer to what SBT does, where you can still load old versions but then save into the newest one. Especially since almost all current sigs are version 0.4...
  • stop with the float/string version, let's stick with integers =]

@olgabot
Copy link
Collaborator

olgabot commented Nov 2, 2019

Sorry I missed this earlier! I like @luizirber's suggestion of keeping the hash function in each individual signature. Would the k-mer size and scale/num be included in this as well, since those also influence the compatibility of comparing signatures?

@luizirber luizirber added the 4.0 issues to address for a 4.0 release label Jun 8, 2020
@luizirber luizirber mentioned this issue Aug 19, 2020
5 tasks
@ctb ctb added 5.0 issues to address for a 5.0 release and removed 4.0 issues to address for a 4.0 release labels Feb 4, 2021
@luizirber luizirber added this to the 5.0 milestone Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release
Projects
None yet
Development

No branches or pull requests

3 participants