Replies: 8 comments
-
Hi Daniel, I believe the intention behind normalization is to avoid being overly-precise when referring to variants. While it is not exactly stated, we should also not become lossy and drop information. Would it be possible to provide the coordinates and ref/alt for the example you are describing? Then it is easier to follow / dig into the details... Thanks! |
Beta Was this translation helpful? Give feedback.
-
REF: TAAAAAAAT How should Variant 2 be normalised? The bounds would be pos1 but Variant 1 is in the way and may or may not change the alt sequence (depends on cis/trans phasing) |
Beta Was this translation helpful? Give feedback.
-
If the variants are cis, the normalisation algorithm results in loss of information as it can be reconstructed as; |
Beta Was this translation helpful? Give feedback.
-
The way I would normalize this is to create the full ref and full alt, and then run through normalize. The problem with your example is that the two variants both go away and get replaced with a single insertion of a T (at 2 in interbase coords). Would a variant-caller detect this as a single insT allele? The two alleles in cis might not be detectable? I assume there is a background story why you would like to report this as a haplotype of the two composite alleles? Here a unit test that reproduces this. This is basically what vrs-python is doing at its core during normalization: from bioutils.normalize import normalize
def test_shuffling():
ref = 'TAAAAAAAT'
alt = 'TATAAAAAAT'
chrom_seq = ref
start = 0
end = len(ref)
shuffle_direction = 'EXPAND'
shuffled_interval, shuffled_alleles = normalize(
chrom_seq, interval=(start, end), alleles=(None, alt), mode=shuffle_direction
)
assert shuffled_interval == (2,2)
assert shuffled_alleles[0] == ''
assert shuffled_alleles[1] == 'T' |
Beta Was this translation helpful? Give feedback.
-
If the variants are unphased the two haplotypes are:
In this context, what do you mean by 'full ref' and 'full alt'? If the variants aren't phased, we don't know what the alt haplotype actually is. |
Beta Was this translation helpful? Give feedback.
-
My point is that the normalisation procedure doesn't state whether the variant should be extended with the ref or the alt. If it's the ref then normalisation loses information about the relative position of variants. If it's the alt then there's an implicit requirement to both report and merge all variants in potentially-normalising variant positions - something that many data sources do not do. There's also the problem of what to do with unphased variants.
The typical use case I encounter this issue is is with VNTRs/STRs with imperfect repeats in the reference (e.g. ACACACATACAC) but an expanded/contracted repeat in sample (e.g. T>C, insAC), or vice versa. These get reported in VCF as separate variants. The imperfect reference version of this issue is interesting because making the repeat perfect (i.e. the T>C) actually widens interval over which the other variant could have biologically occurred (i.e. the AC could be have been inserted anywhere in the repeat) - a behaviour that complicates evolutionary tree reconstruction. |
Beta Was this translation helpful? Give feedback.
-
Reading your comment about repeats reminds me of the challenges with over-precision in the representation of variants in repeats. By left or right shuffling them, we pick one of several possible alternative alignments of how reads can map to the reference. However it is not possible to identify what specific nucleotide was inserted or deleted and there are multiple alternative solutions. Referring to the whole region of ambiguity can help with several applications and I do wonder if a fully-justified representation of the STRs would help with the evolutionary tree reconstruction as well.
I meant to represent the haplotype of the two variants, in a fully justified representation. You are doing that by this:
Now the problem is that a variant caller (and our normalization) would call the |
Beta Was this translation helpful? Give feedback.
-
Hi @d-cameron. In general, when we represent variants in VRS, we focus on the observed state. If two variants are unphased, by definition the variant caller has not observed this in-cis, and these are each reported independently as unphased variants. Even when variants are in-cis, VRS recommendations are to treat each variant in the haplotype / cis-phased block independently with respect to location. Often, variants are reported on GRC human genome assemblies, though in specific application domains (such as evolutionary tree reconstruction) this might be inappropriate, and different sequence references should be considered. VRS intentionally offers no specific guidance on use cases such as this. More specific to your point, VRS has historically shied away from absolutely requiring variant normalization as defined in the spec; the VRS recommendation is that variants SHOULD be normalized following recommended conventions, though there may be applications where different strategies are necessary. If you have an alternate normalization strategy, or a clarification of the existing strategy, that meets your needs please do share so we may consider adoption! In VRS 1.x, we have shied away from unphasing except for explicit declarations of in-cis phasing (the haplotype, now cis-phased block). I think it would be worthwhile to reopen Genotype and phasing in the VRS 2.x series if there are community implementations that are seeking this. |
Beta Was this translation helpful? Give feedback.
-
If I have two nearby unphased indel variants it is unclear how they are supposed to be normalised.
Following https://vrs.ga4gh.org/en/stable/impl-guide/normalization.html, I get to the step 3a:
is equal to the base preceding
How is this defined for the Alternate Allele Sequence?
Defining it as preceding reference sequence is problematic as if If I have phased A-T and an A insertion in a poly-A region, normalisation is lossy as it removes the information about whether the T occurs before or after the A ins.
Defining it as the preceding alt haplotype base is also problematic as 1) if there's a nearby unphased variant then the alt haplotype sequence is not known at that position and 2) this forces nearby variants to be merged into a single variant or (if the flanking variant causes the extension to stop) 3) changes the isolated normalised representation of the variant (e.g. the A INS no longer covers the full poly-A sequence, but if you renormalise it in isolation it gets expanded to the full poly-A).
How should an implementation normalise such variants?
Beta Was this translation helpful? Give feedback.
All reactions