computed identifiers using pydantic model serializers #342

ahwagner · 2024-02-07T13:40:44Z

WIP: addresses #341, #335, #334. Uses the pydantic .model_dump_json() call to serialize for computed digests, and adds some logic to cache digests and computed identifiers in appropriate object fields when calculated.

Adds tests to validate pydantic models match data class and field names from VRS Schema.

Disables Genotypes.

Remaining work:

Address translator tests
Refactor enref / deref extras

ahwagner · 2024-02-10T18:14:22Z

@theferrit32 @korikuzma @larrybabb @andreasprlic @ehclark please note this in-progress refactor of the serialization code. It is close to ready for review, the primary thing I want to address is checking the translator extras test suite, after which I plan to hand this off for review.

I don't plan to address the enref/deref code refactor as part of this work so have disabled the tests for those methods. @theferrit32 I tagged you under "Assigned" here in case those are important to you and you want to address them as part of this PR.

larrybabb · 2024-02-14T15:31:09Z

@ahwagner have you and @ehclark discussed this and the timing to deliver it? I'm just trying to sync up on whether this is gating @ehclark's ability to move forward with VRS 2.0 on his project. If this is going to alter the digests I believe it will.

ehclark · 2024-02-15T14:54:58Z

Working on updating VCF unit tests I ran across this. This looks problematic to me, but wanted to run it by others before I dig in too deep.

>>> def get_seq_from_rle(allele, data_proxy):
...   seqId = f"ga4gh:{allele.location.sequenceReference.refgetAccession}"
...   start = allele.location.start
...   end = start + allele.state.repeatSubunitLength
...   subseq = data_proxy.get_sequence(seqId, start, end) # sequence retrieval function, e.g. from SeqRepo
...   c = cycle(subseq)
...   derivedseq = ''
...   for i in range(allele.state.length):
...     derivedseq += next(c)
...   return derivedseq
... 
>>> allele = tlr._from_hgvs('NC_000019.10:g.289464_289465insCACGCCTGTAATCC')
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.LqwjK2sadi1_E3bedaZJxatGrCGK8qV3', 'type': 'Allele', 'digest': 'LqwjK2sadi1_E3bedaZJxatGrCGK8qV3', 'location': {'id': 'ga4gh:SL.L145KFLJeJ334YnOVm59pPlbdqfHhgXZ', 'type': 'SequenceLocation', 'digest': 'L145KFLJeJ334YnOVm59pPlbdqfHhgXZ', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'}, 'start': 289464, 'end': 289466}, 'state': {'type': 'ReferenceLengthExpression', 'length': 16, 'sequence': 'CACGCCTGTAATCCCA', 'repeatSubunitLength': 14}}
>>> get_seq_from_rle(allele, data_proxy)
'CAGCACTTTGGGAGCA'

It seems to me that the sequence returned by get_seq_from_rle should contain the inserted bases, but it does not.

This is the reference sequence starting at position 289450

289450 cccggcgtggtggctcagcactttgggaggccgaggcgggcagatcacga

This is the sequence with the insertion as defined by the HGVS expression

289450 cccggcgtggtggctCACGCCTGTAATCCcagcactttgggaggccgaggcgggcagatcacga

This is the sequence derived from the RLE

289450 cccggcgtggtggctCAGCACTTTGGGAGCAgcactttgggaggccgaggcgggcagatcacga

Another example:

>>> allele = tlr._from_hgvs('NC_000019.10:g.289464_289465insTTTTTT')
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.JAv2mBwljFih5BYOikHyRpqQER1rGzet', 'type': 'Allele', 'digest': 'JAv2mBwljFih5BYOikHyRpqQER1rGzet', 'location': {'id': 'ga4gh:SL.qwpto8M7ZkWFmY_-8LpUuihrIwG-VqsJ', 'type': 'SequenceLocation', 'digest': 'qwpto8M7ZkWFmY_-8LpUuihrIwG-VqsJ', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'}, 'start': 289463, 'end': 289464}, 'state': {'type': 'ReferenceLengthExpression', 'length': 7, 'sequence': 'TTTTTTT', 'repeatSubunitLength': 6}}
>>> get_seq_from_rle(allele, srp)
'TCAGCAT'

ahwagner · 2024-02-15T15:35:34Z

Great catch, agreed that this needs to be addressed. I don't think we accounted for this case (ambiguous insertion of non-referenced-derived sequence) in our RLE normalization logic. I'll take a look at this tomorrow.

… derived from the reference sequence

ehclark · 2024-02-15T20:57:48Z

Great catch, agreed that this needs to be addressed. I don't think we accounted for this case (ambiguous insertion of non-referenced-derived sequence) in our RLE normalization logic. I'll take a look at this tomorrow.

I went ahead and implemented a change so that LSE is now used for ambiguous insertion of non-reference derived bases. I could not think of a better method than just running the derivation logic and comparing. Its not great from a performance standpoint, so if you have a better idea @ahwagner please go ahead and improve on things.

…annot be derived from the reference sequence" This reverts commit f407fd0.

ahwagner · 2024-02-19T07:42:30Z

This is ready for hand-off. I updated the VRS normalization algorithm (ga4gh/vrs@7871872) to account for ambiguous novel sequence insertions. Most tests are passing, though there are some failures in the extras/test_vcf_annotation.py module that should be looked at.

@theferrit32 and @korikuzma please discuss and assign investigation of those test failures, after which this PR is ready for review.

korikuzma · 2024-02-19T12:36:49Z

@theferrit32 I'm fine wrapping this up if you don't have time / want to. Just let me know!

ahwagner · 2024-02-19T13:52:54Z

I was overthinking this and want to make another edit. Converting to draft for a bit while I work on it.

ahwagner · 2024-02-19T16:17:20Z

Alright, ball is back in your court @theferrit32 and @korikuzma. Implemented, up-to-date algo is here. Still need someone to look at the test_vcf_annotation failures.

larrybabb

+1 @ahwagner Should wait until test_vcf_annotation failures are corrected before merging? If not. I'll merge this PR.

korikuzma · 2024-02-19T22:50:30Z

+1 @ahwagner Should wait until test_vcf_annotation failures are corrected before merging? If not. I'll merge this PR.

@larrybabb Correct. Wait to merge until tests are passing. We should add this in our branch protection rules.

ahwagner · 2024-02-20T13:14:48Z

@korikuzma I just added branch protections to require status pass for all tests.

korikuzma · 2024-02-20T19:55:22Z

@ahwagner can you remind me why we are disabling genotype?

submodules/vrs

chr19-54220999-A-A had that `C` was the actual ref seq, but it should have been `T`

korikuzma · 2024-02-21T13:00:48Z

@ahwagner tests are passing. In the tests files I changed, VRS_Error=Expected reference sequence A on GRCh38:chr19 at positions (54220998%2C 54220999) but found C was changed to VRS_Error=Expected reference sequence A on GRCh38:chr19 at positions (54220998%2C 54220999) but found T

>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir="/usr/local/share/seqrepo/latest")
>>> sr["NC_000019.10"][54220999-1]
'T'
>>> sr["NC_000019.10"][54220999]
'C'

larrybabb

+1 Thanks @korikuzma

larrybabb · 2024-02-21T13:44:19Z

@ehclark sorry for the delays. This is the version of vrs-python we will be moving forward with. We assume there will be no more foreseeable changes to the digest for 2.0 (but there are no guarantees of course).

ahwagner · 2024-02-21T13:56:25Z

@larrybabb @ehclark we don't foresee any changes to Allele or SequenceLocation digests. However, Haplotype, Adjacency, and other structures under discussion for the SV work are still likely to change.

ahwagner added 11 commits February 6, 2024 19:14

move digest from VO to GA4GH identifiable

8d23685

add ga4gh serialize to models

c63af59

remove computed field and add get_or_create

c2436f8

attribute ordering fixes

89e256c

add tests for schema to pydantic matching

6b789a2

update is_identifiable

a173cf4

update model validations

5e38b61

refactor identifier code

9ccef4b

fix IRI behavior when serialized alone

3674042

restore haplotype as unordered List with Serializer override

9ab2026

add context control support for in-place edits

f0402ea

ahwagner changed the title ~~Issue 341~~ Embed computed digests and serialization code in pydantic models Feb 10, 2024

ahwagner changed the title ~~Embed computed digests and serialization code in pydantic models~~ computed identifiers as pydantic model serializers Feb 10, 2024

ahwagner changed the title ~~computed identifiers as pydantic model serializers~~ computed identifiers using pydantic model serializers Feb 10, 2024

ahwagner assigned theferrit32 Feb 10, 2024

ehclark added 2 commits February 15, 2024 15:52

Use LiteralSequenceExpression for ambiguous insertions that cannot be…

f407fd0

… derived from the reference sequence

Update VCF unit tests to match new digest logic

694ce86

ahwagner added 5 commits February 16, 2024 15:30

update digests and message structure for trx test

13a45ac

Revert "Use LiteralSequenceExpression for ambiguous insertions that c…

e15837a

…annot be derived from the reference sequence" This reverts commit f407fd0.

remove unnecessary try/except

1ff0695

check insertions for ambiguous novel sequence

96d98cd

update test cassettes

9806784

ahwagner marked this pull request as ready for review February 19, 2024 07:38

ahwagner requested review from a team as code owners February 19, 2024 07:38

Merge branch 'main' into issue-341

ae02cc5

ahwagner marked this pull request as draft February 19, 2024 13:51

ahwagner added 2 commits February 19, 2024 10:58

restore use of VOCA seed for RSL

b51fb6c

add TODO

8faaf00

ahwagner marked this pull request as ready for review February 19, 2024 16:10

larrybabb approved these changes Feb 19, 2024

View reviewed changes

korikuzma reviewed Feb 20, 2024

View reviewed changes

submodules/vrs Show resolved Hide resolved

ahwagner mentioned this pull request Feb 20, 2024

Disabling Genotype #349

Open

test: fix vcf annotation test

7cbae0d

chr19-54220999-A-A had that `C` was the actual ref seq, but it should have been `T`

larrybabb approved these changes Feb 21, 2024

View reviewed changes

larrybabb merged commit 64fee4c into main Feb 21, 2024
8 checks passed

larrybabb deleted the issue-341 branch February 21, 2024 13:42

This was referenced Feb 22, 2024

Fix Enref behavior for haplotypes #335

Open

Standardize serialize behavior for SequenceReference #341

Closed

Normalization extensions for VRS 2.x #334

Open

ehclark mentioned this pull request Feb 22, 2024

Repeat subunit length not always correct #351

Closed

This was referenced Mar 1, 2024

Update unit tests to match latest serialization model for VRS 2.0.0a5 biocommons/anyvar#85

Closed

update enref / deref #356

Closed

korikuzma mentioned this pull request Mar 18, 2024

Include digest field in response #251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computed identifiers using pydantic model serializers #342

computed identifiers using pydantic model serializers #342

ahwagner commented Feb 7, 2024 •

edited

Loading

ahwagner commented Feb 10, 2024

larrybabb commented Feb 14, 2024

ehclark commented Feb 15, 2024

ahwagner commented Feb 15, 2024

ehclark commented Feb 15, 2024

ahwagner commented Feb 19, 2024

korikuzma commented Feb 19, 2024

ahwagner commented Feb 19, 2024

ahwagner commented Feb 19, 2024

larrybabb left a comment •

edited

Loading

korikuzma commented Feb 19, 2024 •

edited

Loading

ahwagner commented Feb 20, 2024

korikuzma commented Feb 20, 2024

korikuzma commented Feb 21, 2024

larrybabb left a comment

larrybabb commented Feb 21, 2024

ahwagner commented Feb 21, 2024

computed identifiers using pydantic model serializers #342

computed identifiers using pydantic model serializers #342

Conversation

ahwagner commented Feb 7, 2024 • edited Loading

ahwagner commented Feb 10, 2024

larrybabb commented Feb 14, 2024

ehclark commented Feb 15, 2024

ahwagner commented Feb 15, 2024

ehclark commented Feb 15, 2024

ahwagner commented Feb 19, 2024

korikuzma commented Feb 19, 2024

ahwagner commented Feb 19, 2024

ahwagner commented Feb 19, 2024

larrybabb left a comment • edited Loading

Choose a reason for hiding this comment

korikuzma commented Feb 19, 2024 • edited Loading

ahwagner commented Feb 20, 2024

korikuzma commented Feb 20, 2024

korikuzma commented Feb 21, 2024

larrybabb left a comment

Choose a reason for hiding this comment

larrybabb commented Feb 21, 2024

ahwagner commented Feb 21, 2024

ahwagner commented Feb 7, 2024 •

edited

Loading

larrybabb left a comment •

edited

Loading

korikuzma commented Feb 19, 2024 •

edited

Loading