Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ga4gh_identify adds digest properties in-place regardless of in_place argument #440

Open
jsstevenson opened this issue Jul 25, 2024 · 1 comment · May be fixed by #461
Open

ga4gh_identify adds digest properties in-place regardless of in_place argument #440

jsstevenson opened this issue Jul 25, 2024 · 1 comment · May be fixed by #461
Labels
bug Something isn't working question Further information is requested

Comments

@jsstevenson
Copy link
Contributor

I have a basic allele. If I manually add the VA ID, the compacted JSON dump is 323 chars:

>>> allele = Allele(location={"end": 44908822, "start": 44908821, "sequenceReference": {"id": "NC_0000019.10", "type": "SequenceReference", "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"}, "type": "SequenceLocation"}, state={"sequence": "T", "type":  "LiteralSequenceExpression"}, type="Allele")
>>> allele.id = "ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt"
>>> len(allele.model_dump_json(exclude_none=True))
323

If I use ga4gh_identify, though, even if I set in_place to "never", digest fields are added that makes the final JSON over 400 characters:

>>> allele = Allele(location={"end": 44908822, "start": 44908821, "sequenceReference": {"id": "NC_0000019.10", "type": "SequenceReference", "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"}, "type": "SequenceLocation"}, state={"sequence": "T", "type":  "LiteralSequenceExpression"}, type="Allele")
>>> allele.id = ga4gh_identify(allele, "never")
>>> len(allele.model_dump_json(exclude_none=True))
411
>>> print(allele.model_dump_json(exclude_none=True, indent=2))
{
  "id": "ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt",
  "type": "Allele",
  "digest": "0AePZIWZUNsUlQTamyLrjm2HWUw2opLt",
  "location": {
    "type": "SequenceLocation",
    "digest": "wIlaGykfwHIpPY2Fcxtbx4TINbbODFVz",
    "sequenceReference": {
      "id": "NC_0000019.10",
      "type": "SequenceReference",
      "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"
    },
    "start": 44908821,
    "end": 44908822
  },
  "state": {
    "type": "LiteralSequenceExpression",
    "sequence": "T"
  }
}

This isn't the end of the world, but an increase in size of ~25% ends up being pretty hefty in extreme cases. is it intentional/necessary for ga4gh_identify to add these digest fields in-place, or can that be taken out? Or alternatively, maybe a ga4gh_compact method could strip unset/redundant fields?

@jsstevenson jsstevenson added bug Something isn't working question Further information is requested labels Jul 25, 2024
@korikuzma
Copy link
Contributor

Since digest is optional, I think it's a bug. I think in_place should apply to both id and digest fields. @ga4gh/vrs-python-maintainers thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants