How to normalize unphased variants? #534

d-cameron · 2024-06-28T02:40:07Z

d-cameron
Jun 28, 2024
Collaborator

If I have two nearby unphased indel variants it is unclear how they are supposed to be normalised.

Following https://vrs.ga4gh.org/en/stable/impl-guide/normalization.html, I get to the step 3a:
is equal to the base preceding

How is this defined for the Alternate Allele Sequence?

Defining it as preceding reference sequence is problematic as if If I have phased A-T and an A insertion in a poly-A region, normalisation is lossy as it removes the information about whether the T occurs before or after the A ins.

Defining it as the preceding alt haplotype base is also problematic as 1) if there's a nearby unphased variant then the alt haplotype sequence is not known at that position and 2) this forces nearby variants to be merged into a single variant or (if the flanking variant causes the extension to stop) 3) changes the isolated normalised representation of the variant (e.g. the A INS no longer covers the full poly-A sequence, but if you renormalise it in isolation it gets expanded to the full poly-A).

How should an implementation normalise such variants?

andreasprlic · 2024-06-28T04:47:08Z

andreasprlic
Jun 28, 2024

Hi Daniel, I believe the intention behind normalization is to avoid being overly-precise when referring to variants. While it is not exactly stated, we should also not become lossy and drop information. Would it be possible to provide the coordinates and ref/alt for the example you are describing? Then it is easier to follow / dig into the details... Thanks!

0 replies

d-cameron · 2024-07-03T23:44:59Z

d-cameron
Jul 3, 2024
Collaborator Author

REF: TAAAAAAAT
Variant 1: A>T @ position 3
Variant 2: A>AA @ position 5

How should Variant 2 be normalised? The bounds would be pos1 but Variant 1 is in the way and may or may not change the alt sequence (depends on cis/trans phasing)

0 replies

d-cameron · 2024-07-03T23:47:28Z

d-cameron
Jul 3, 2024
Collaborator Author

If the variants are cis, the normalisation algorithm results in loss of information as it can be reconstructed as;
TATAAAAAAT (correct) or
TAATAAAAAT (wrong)

0 replies

andreasprlic · 2024-07-05T05:13:43Z

andreasprlic
Jul 5, 2024

The way I would normalize this is to create the full ref and full alt, and then run through normalize. The problem with your example is that the two variants both go away and get replaced with a single insertion of a T (at 2 in interbase coords).

Would a variant-caller detect this as a single insT allele? The two alleles in cis might not be detectable? I assume there is a background story why you would like to report this as a haplotype of the two composite alleles?

Here a unit test that reproduces this. This is basically what vrs-python is doing at its core during normalization:

from bioutils.normalize import normalize

def test_shuffling():

    ref = 'TAAAAAAAT'
    alt = 'TATAAAAAAT'
    
    chrom_seq = ref
    start = 0
    end = len(ref)
    shuffle_direction = 'EXPAND'

    shuffled_interval, shuffled_alleles = normalize(
            chrom_seq, interval=(start, end), alleles=(None, alt), mode=shuffle_direction
        )
    
    assert shuffled_interval == (2,2)
    assert shuffled_alleles[0] == ''
    assert shuffled_alleles[1] == 'T'

0 replies

d-cameron · 2024-07-08T23:21:19Z

d-cameron
Jul 8, 2024
Collaborator Author

If the variants are unphased the two haplotypes are:

TAAAAAAAT/TATAAAAAAT (ref/both)
or
TATAAAAAT/TAAAAAAAAT (first/second)

create the full ref and full alt

In this context, what do you mean by 'full ref' and 'full alt'? If the variants aren't phased, we don't know what the alt haplotype actually is.

0 replies

d-cameron · 2024-07-08T23:51:36Z

d-cameron
Jul 8, 2024
Collaborator Author

My point is that the normalisation procedure doesn't state whether the variant should be extended with the ref or the alt. If it's the ref then normalisation loses information about the relative position of variants. If it's the alt then there's an implicit requirement to both report and merge all variants in potentially-normalising variant positions - something that many data sources do not do. There's also the problem of what to do with unphased variants.

Would a variant-caller detect this as a single insT allele? The two alleles in cis might not be detectable? I assume there is a background story why you would like to report this as a haplotype of the two composite alleles?

The typical use case I encounter this issue is is with VNTRs/STRs with imperfect repeats in the reference (e.g. ACACACATACAC) but an expanded/contracted repeat in sample (e.g. T>C, insAC), or vice versa. These get reported in VCF as separate variants. The imperfect reference version of this issue is interesting because making the repeat perfect (i.e. the T>C) actually widens interval over which the other variant could have biologically occurred (i.e. the AC could be have been inserted anywhere in the repeat) - a behaviour that complicates evolutionary tree reconstruction.

0 replies

andreasprlic · 2024-07-10T06:04:53Z

andreasprlic
Jul 10, 2024

Reading your comment about repeats reminds me of the challenges with over-precision in the representation of variants in repeats. By left or right shuffling them, we pick one of several possible alternative alignments of how reads can map to the reference. However it is not possible to identify what specific nucleotide was inserted or deleted and there are multiple alternative solutions. Referring to the whole region of ambiguity can help with several applications and I do wonder if a fully-justified representation of the STRs would help with the evolutionary tree reconstruction as well.

In this context, what do you mean by 'full ref' and 'full alt'?

I meant to represent the haplotype of the two variants, in a fully justified representation.

You are doing that by this:

TAAAAAAAT/TATAAAAAAT (ref/both)
or
TATAAAAAT/TAAAAAAAAT (first/second)

Now the problem is that a variant caller (and our normalization) would call the (both) haplotype as an insT, so it is no longer a haplotype and we are dealing with a single allele. That might make this representation less useful for evolutionary tree reconstruction. In this case, "expanding" the region around the alt again (going back to the the TATAAAAAAT description), and using the 4 alleles as you described them ^^^ might actually be what is most useful? (I note the that start/end coordinates for the ref representation would be identical for all 4 variants and could perhaps be an anchor for aligning the sequences? just a wild guess from my side here).

0 replies

ahwagner · 2024-08-22T19:02:43Z

ahwagner
Aug 22, 2024
Maintainer

Hi @d-cameron. In general, when we represent variants in VRS, we focus on the observed state. If two variants are unphased, by definition the variant caller has not observed this in-cis, and these are each reported independently as unphased variants. Even when variants are in-cis, VRS recommendations are to treat each variant in the haplotype / cis-phased block independently with respect to location. Often, variants are reported on GRC human genome assemblies, though in specific application domains (such as evolutionary tree reconstruction) this might be inappropriate, and different sequence references should be considered. VRS intentionally offers no specific guidance on use cases such as this.

More specific to your point, VRS has historically shied away from absolutely requiring variant normalization as defined in the spec; the VRS recommendation is that variants SHOULD be normalized following recommended conventions, though there may be applications where different strategies are necessary. If you have an alternate normalization strategy, or a clarification of the existing strategy, that meets your needs please do share so we may consider adoption!

In VRS 1.x, we have shied away from unphasing except for explicit declarations of in-cis phasing (the haplotype, now cis-phased block). I think it would be worthwhile to reopen Genotype and phasing in the VRS 2.x series if there are community implementations that are seeking this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to normalize unphased variants? #534

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to normalize unphased variants? #534

d-cameron Jun 28, 2024 Collaborator

Replies: 8 comments

andreasprlic Jun 28, 2024

d-cameron Jul 3, 2024 Collaborator Author

d-cameron Jul 3, 2024 Collaborator Author

andreasprlic Jul 5, 2024

d-cameron Jul 8, 2024 Collaborator Author

d-cameron Jul 8, 2024 Collaborator Author

andreasprlic Jul 10, 2024

ahwagner Aug 22, 2024 Maintainer

d-cameron
Jun 28, 2024
Collaborator

andreasprlic
Jun 28, 2024

d-cameron
Jul 3, 2024
Collaborator Author

d-cameron
Jul 3, 2024
Collaborator Author

andreasprlic
Jul 5, 2024

d-cameron
Jul 8, 2024
Collaborator Author

d-cameron
Jul 8, 2024
Collaborator Author

andreasprlic
Jul 10, 2024

ahwagner
Aug 22, 2024
Maintainer