Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computed identifiers using pydantic model serializers #342

Merged
merged 22 commits into from
Feb 21, 2024
Merged

Conversation

ahwagner
Copy link
Member

@ahwagner ahwagner commented Feb 7, 2024

WIP: addresses #341, #335, #334. Uses the pydantic .model_dump_json() call to serialize for computed digests, and adds some logic to cache digests and computed identifiers in appropriate object fields when calculated.

Adds tests to validate pydantic models match data class and field names from VRS Schema.

Disables Genotypes.

Remaining work:

  • Address translator tests
  • Refactor enref / deref extras

@ahwagner ahwagner changed the title Issue 341 Embed computed digests and serialization code in pydantic models Feb 10, 2024
@ahwagner ahwagner changed the title Embed computed digests and serialization code in pydantic models computed identifiers as pydantic model serializers Feb 10, 2024
@ahwagner ahwagner changed the title computed identifiers as pydantic model serializers computed identifiers using pydantic model serializers Feb 10, 2024
@ahwagner
Copy link
Member Author

@theferrit32 @korikuzma @larrybabb @andreasprlic @ehclark please note this in-progress refactor of the serialization code. It is close to ready for review, the primary thing I want to address is checking the translator extras test suite, after which I plan to hand this off for review.

I don't plan to address the enref/deref code refactor as part of this work so have disabled the tests for those methods. @theferrit32 I tagged you under "Assigned" here in case those are important to you and you want to address them as part of this PR.

@larrybabb
Copy link
Contributor

@ahwagner have you and @ehclark discussed this and the timing to deliver it? I'm just trying to sync up on whether this is gating @ehclark's ability to move forward with VRS 2.0 on his project. If this is going to alter the digests I believe it will.

@ehclark
Copy link
Contributor

ehclark commented Feb 15, 2024

Working on updating VCF unit tests I ran across this. This looks problematic to me, but wanted to run it by others before I dig in too deep.

>>> def get_seq_from_rle(allele, data_proxy):
...   seqId = f"ga4gh:{allele.location.sequenceReference.refgetAccession}"
...   start = allele.location.start
...   end = start + allele.state.repeatSubunitLength
...   subseq = data_proxy.get_sequence(seqId, start, end) # sequence retrieval function, e.g. from SeqRepo
...   c = cycle(subseq)
...   derivedseq = ''
...   for i in range(allele.state.length):
...     derivedseq += next(c)
...   return derivedseq
... 
>>> allele = tlr._from_hgvs('NC_000019.10:g.289464_289465insCACGCCTGTAATCC')
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.LqwjK2sadi1_E3bedaZJxatGrCGK8qV3', 'type': 'Allele', 'digest': 'LqwjK2sadi1_E3bedaZJxatGrCGK8qV3', 'location': {'id': 'ga4gh:SL.L145KFLJeJ334YnOVm59pPlbdqfHhgXZ', 'type': 'SequenceLocation', 'digest': 'L145KFLJeJ334YnOVm59pPlbdqfHhgXZ', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'}, 'start': 289464, 'end': 289466}, 'state': {'type': 'ReferenceLengthExpression', 'length': 16, 'sequence': 'CACGCCTGTAATCCCA', 'repeatSubunitLength': 14}}
>>> get_seq_from_rle(allele, data_proxy)
'CAGCACTTTGGGAGCA'

It seems to me that the sequence returned by get_seq_from_rle should contain the inserted bases, but it does not.

This is the reference sequence starting at position 289450

289450 cccggcgtggtggctcagcactttgggaggccgaggcgggcagatcacga

This is the sequence with the insertion as defined by the HGVS expression

289450 cccggcgtggtggctCACGCCTGTAATCCcagcactttgggaggccgaggcgggcagatcacga

This is the sequence derived from the RLE

289450 cccggcgtggtggctCAGCACTTTGGGAGCAgcactttgggaggccgaggcgggcagatcacga

Another example:

>>> allele = tlr._from_hgvs('NC_000019.10:g.289464_289465insTTTTTT')
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.JAv2mBwljFih5BYOikHyRpqQER1rGzet', 'type': 'Allele', 'digest': 'JAv2mBwljFih5BYOikHyRpqQER1rGzet', 'location': {'id': 'ga4gh:SL.qwpto8M7ZkWFmY_-8LpUuihrIwG-VqsJ', 'type': 'SequenceLocation', 'digest': 'qwpto8M7ZkWFmY_-8LpUuihrIwG-VqsJ', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'}, 'start': 289463, 'end': 289464}, 'state': {'type': 'ReferenceLengthExpression', 'length': 7, 'sequence': 'TTTTTTT', 'repeatSubunitLength': 6}}
>>> get_seq_from_rle(allele, srp)
'TCAGCAT'

@ahwagner
Copy link
Member Author

Great catch, agreed that this needs to be addressed. I don't think we accounted for this case (ambiguous insertion of non-referenced-derived sequence) in our RLE normalization logic. I'll take a look at this tomorrow.

@ehclark
Copy link
Contributor

ehclark commented Feb 15, 2024

Great catch, agreed that this needs to be addressed. I don't think we accounted for this case (ambiguous insertion of non-referenced-derived sequence) in our RLE normalization logic. I'll take a look at this tomorrow.

I went ahead and implemented a change so that LSE is now used for ambiguous insertion of non-reference derived bases. I could not think of a better method than just running the derivation logic and comparing. Its not great from a performance standpoint, so if you have a better idea @ahwagner please go ahead and improve on things.

@ahwagner ahwagner marked this pull request as ready for review February 19, 2024 07:38
@ahwagner ahwagner requested review from a team as code owners February 19, 2024 07:38
@ahwagner
Copy link
Member Author

This is ready for hand-off. I updated the VRS normalization algorithm (ga4gh/vrs@7871872) to account for ambiguous novel sequence insertions. Most tests are passing, though there are some failures in the extras/test_vcf_annotation.py module that should be looked at.

@theferrit32 and @korikuzma please discuss and assign investigation of those test failures, after which this PR is ready for review.

@korikuzma
Copy link
Contributor

@theferrit32 I'm fine wrapping this up if you don't have time / want to. Just let me know!

@ahwagner ahwagner marked this pull request as draft February 19, 2024 13:51
@ahwagner
Copy link
Member Author

I was overthinking this and want to make another edit. Converting to draft for a bit while I work on it.

@ahwagner ahwagner marked this pull request as ready for review February 19, 2024 16:10
@ahwagner
Copy link
Member Author

Alright, ball is back in your court @theferrit32 and @korikuzma. Implemented, up-to-date algo is here. Still need someone to look at the test_vcf_annotation failures.

Copy link
Contributor

@larrybabb larrybabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @ahwagner Should wait until test_vcf_annotation failures are corrected before merging? If not. I'll merge this PR.

@korikuzma
Copy link
Contributor

korikuzma commented Feb 19, 2024

+1 @ahwagner Should wait until test_vcf_annotation failures are corrected before merging? If not. I'll merge this PR.

@larrybabb Correct. Wait to merge until tests are passing. We should add this in our branch protection rules.

@ahwagner
Copy link
Member Author

@korikuzma I just added branch protections to require status pass for all tests.

@korikuzma
Copy link
Contributor

@ahwagner can you remind me why we are disabling genotype?

@ahwagner ahwagner mentioned this pull request Feb 20, 2024
chr19-54220999-A-A had that `C` was the actual ref seq, but it should
have been `T`
@korikuzma
Copy link
Contributor

@ahwagner tests are passing. In the tests files I changed, VRS_Error=Expected reference sequence A on GRCh38:chr19 at positions (54220998%2C 54220999) but found C was changed to VRS_Error=Expected reference sequence A on GRCh38:chr19 at positions (54220998%2C 54220999) but found T

>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir="/usr/local/share/seqrepo/latest")
>>> sr["NC_000019.10"][54220999-1]
'T'
>>> sr["NC_000019.10"][54220999]
'C'

Copy link
Contributor

@larrybabb larrybabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks @korikuzma

@larrybabb larrybabb merged commit 64fee4c into main Feb 21, 2024
8 checks passed
@larrybabb larrybabb deleted the issue-341 branch February 21, 2024 13:42
@larrybabb
Copy link
Contributor

@ehclark sorry for the delays. This is the version of vrs-python we will be moving forward with. We assume there will be no more foreseeable changes to the digest for 2.0 (but there are no guarantees of course).

@ahwagner
Copy link
Member Author

@larrybabb @ehclark we don't foresee any changes to Allele or SequenceLocation digests. However, Haplotype, Adjacency, and other structures under discussion for the SV work are still likely to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants