Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeat subunit length not always correct #351

Closed
ehclark opened this issue Feb 22, 2024 · 1 comment · Fixed by #352
Closed

Repeat subunit length not always correct #351

ehclark opened this issue Feb 22, 2024 · 1 comment · Fixed by #352
Assignees
Labels
2.0-alpha Issues related to VRS 2.0-alpha branch bug Something isn't working priority:high High priority

Comments

@ehclark
Copy link
Contributor

ehclark commented Feb 22, 2024

It would appear that in the course of completing #342, an important line of code got removed related to correct handling for ambiguous insertions.

The line of code repeat_subunit_length = math.gcd(len_extended_ref, len_extended_alt) in src/ga4gh/vrs/normalize.py was added in 96d98cd, but then dropped in b51fb6c.

Without this line, the repeatSubunitLength is not always correct.

Example:

>>> def get_seq_from_rle(allele, data_proxy):
...     seqId = f"ga4gh:{allele.location.sequenceReference.refgetAccession}"
...     start = allele.location.start
...     end = start + allele.state.repeatSubunitLength
...     subseq = data_proxy.get_sequence(seqId, start, end) # sequence retrieval function, e.g. from SeqRepo
...     c = cycle(subseq)
...     derivedseq = ''
...     for i in range(allele.state.length):
...         derivedseq += next(c)
...     return derivedseq
... 
>>> 
>>> tlr = AlleleTranslator(data_proxy=dp, default_assembly_name="GRCh38", identify=True, normalize=True)
>>> allele = tlr.translate_from('NC_000019.10:g.289464_289465insCACA', 'hgvs')
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.6WEakVj5V1rTYtbNX8Xqx4bMN4pMA6fh', 'type': 'Allele', 'digest': '6WEakVj5V1rTYtbNX8Xqx4bMN4pMA6fh', 'location': {'id': 'ga4gh:SL.L145KFLJeJ334YnOVm59pPlbdqfHhgXZ', 'type': 'SequenceLocation', 'digest': 'L145KFLJeJ334YnOVm59pPlbdqfHhgXZ', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'}, 'start': 289464, 'end': 289466}, 'state': {'type': 'ReferenceLengthExpression', 'length': 6, 'sequence': 'CACACA', 'repeatSubunitLength': 4}}
>>> get_seq_from_rle(allele, dp)
'CAGCCA'
@ehclark ehclark self-assigned this Feb 22, 2024
@ehclark ehclark added bug Something isn't working 2.0-alpha Issues related to VRS 2.0-alpha branch priority:high High priority labels Feb 22, 2024
@ehclark
Copy link
Contributor Author

ehclark commented Feb 22, 2024

To identify this problem I wrote a script that generates random indels for a region and then validates the RLE if the translated allele contains one. Here is the script:

from ga4gh.core import ga4gh_serialize
from itertools import cycle

from biocommons.seqrepo import SeqRepo
from ga4gh.vrs.dataproxy import SeqRepoDataProxy
from ga4gh.vrs.extras.translator import AlleleTranslator
import random
import sys

def get_seq_from_rle(allele, data_proxy):
    seqId = f"ga4gh:{allele.location.sequenceReference.refgetAccession}"
    start = allele.location.start
    end = start + allele.state.repeatSubunitLength
    subseq = data_proxy.get_sequence(seqId, start, end) # sequence retrieval function, e.g. from SeqRepo
    c = cycle(subseq)
    derivedseq = ''
    for i in range(allele.state.length):
        derivedseq += next(c)
    return derivedseq


# script arguments: seqrepo_data_dir seqid regionstart regionend maxinslen
# For example: validate_rle.py /usr/local/share/seqrepo/latest NC_000019.10 289400 289500 10

sr = SeqRepo(root_dir=sys.argv[1])
srp = SeqRepoDataProxy(sr)
tlr = AlleleTranslator(data_proxy=srp, default_assembly_name="GRCh38", identify=True, normalize=True)
seqid = sys.argv[2]
from_pos = int(sys.argv[3])
to_pos = int(sys.argv[4])
max_seq_len = int(sys.argv[5])

alpha = [ 'A', 'C', 'T', 'G' ]
cnt = 0
tests = set()
while (True):
    start = random.randint(from_pos, to_pos - 10)
    has_del = (random.randint(1, 2) == 2)
    end = random.randint(start + 1, to_pos) if has_del else start + 1
    seq_len = random.randint(0 if has_del else 1, max_seq_len)
    seq = ''
    for i in range(seq_len):
        idx = random.randint(0, 3)
        seq = seq + alpha[idx]
    key = f'{start}-{end}-{has_del}-{seq}'
    if key not in tests:
        tests.add(key)
        flank5 = srp.get_sequence(seqid, from_pos - 10, start - 1 if has_del else start)
        flank3 = srp.get_sequence(seqid, end if has_del else end - 1, to_pos + 10)
        hgvs = f'{seqid}:g.{start}_{end}' + ('del' if has_del else '') + (f'ins{seq}' if seq else '')
        allele = tlr.translate_from(hgvs, 'hgvs')
        if (allele.state.type == "ReferenceLengthExpression"):
            expseq = (flank5 + seq + flank3)
            derseq = get_seq_from_rle(allele, srp)
            if derseq not in expseq:
                print(hgvs)
                print(flank5)
                print(seq)
                print(flank3)
                print(ga4gh_serialize(allele.location))
                print(ga4gh_serialize(allele))
                print(expseq)
                print(derseq)
                print('')
        if (len(tests) % 1000) == 0:
            print(f'Completed {len(tests)} tests')
            print('')

ahwagner added a commit that referenced this issue Feb 23, 2024
@ahwagner ahwagner mentioned this issue Feb 23, 2024
larrybabb pushed a commit that referenced this issue Feb 23, 2024
* Restore line that recomputes repeat subunit length

* Added additional HGVS test case

* Only recompute subunit length for ambiguous insertions or deletion/insertions

* Added test case with ambiguous deletion

* addresses #351

---------

Co-authored-by: Eugene Clark <ehclark@partners.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0-alpha Issues related to VRS 2.0-alpha branch bug Something isn't working priority:high High priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant