-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Structural Variant Support #465
Conversation
commit ecd40f0 Author: Tim Hefferon <theffero@nih.gov> Date: Mon Dec 18 13:43:12 2017 -0500 editorial improvements commit a1b6ad4 Merge: 986e835 7aa24be Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Fri Dec 15 12:41:29 2017 -0500 Merge pull request samtools#5 from samtools/master Fix RFC3986 encoding for "%" commit 986e835 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:48:21 2017 -0500 Update VCFv4.3.tex commit c598717 Merge: 46042a8 4c18a2c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:41:20 2017 -0500 Merge pull request samtools#4 from thefferon/revert-whitespace Update VCFv4.3.tex commit 4c18a2c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:39:55 2017 -0500 Update VCFv4.3.tex commit 46042a8 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:31:48 2017 -0500 Add files via upload commit 1d3c079 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:35:38 2017 -0500 cleanup samtools#1 commit 6d30b09 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:33:39 2017 -0500 incorporated @d-cameron's changes from PR samtools#266 commit 1de4ac4 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:25:41 2017 -0500 backed out whitespace changes commit 406e9da Merge: 489b696 ce1e750 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 6 09:11:52 2017 -0500 Merge pull request samtools#3 from thefferon/thefferon-patch-1 Add files via upload commit ce1e750 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 6 09:11:20 2017 -0500 Add files via upload commit 489b696 Merge: 527e04b 2f915a8 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Tue Dec 5 15:54:37 2017 -0500 Merge pull request samtools#2 from samtools/master bringing my fork up to date commit 527e04b Merge: 7c5259c 85a0fef Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Nov 16 16:21:31 2017 -0500 Merge pull request samtools#1 from samtools/master bringing my fork up to date commit 7c5259c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Aug 9 16:25:24 2017 -0400 First go at changes requested in pull request samtools#231 Still evaluating effect of these changes on pdf layout... commit cf2ffa9 Author: Tim Hefferon <theffero@ncbi.nlm.nih.gov> Date: Tue Aug 8 11:22:22 2017 -0400 Made significant updates to Section 3, INFO keys used for structural variants # Conflicts: # VCFv4.3.tex
See #448 for an example of how real-world tools are using fields in a non-compliant manner to work around the lack of proper subclonal support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typos, but overall I agree with merging. I think these changes are a good improvement on their own, so I'm fine with making other PRs to complete everything that was agreed.
Co-Authored-By: jmmut <jomutlo@gmail.com>
…mber semantics that do not match that root type.
I've now stripped the bit where DUP subtypes defined different breakpoints than the root types. This badly breaks backwards compatability and can be better handled as part of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be correct now, but can you clarify these points to me, please?
VCFv4.3.tex
Outdated
\item BND: Breakend | ||
\item DEL: Deletion relative to the reference | ||
\item INS: Insertion relative to the reference | ||
\item DUP: Region of elevated copy number relative to the reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that the description for DUP is different here in SVTYPE than the one in symbolic ALTs? I would change it here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this seems to indicate that if you know there's a duplication but aren't sure if it's tandem or not, you should use SVTYPE=DUP
and have an ALT
of <CNV>
. Is that the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are VCFs out there that use <DUP>
to indicate regions of elevated copy number rather than tandem duplications. Perhaps we should call out a warning about interpreting legacy VCFs given this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-cameron I must say that I really appreciate your decision to approach this with smaller PRs, they're so much easier to discuss than huge ones (which inevitably get stuck after a while)
@d-cameron Added a few comments on places that I think could use clarifications -- sorry that I didn't make them sooner. Again, thanks for spearheading this change. |
Ok, there's quite a bit of discussion around Tools based on micro-array or copy number evidence report In conclusion, the VCF specs don't actualy specify what symbolic structural variants actually mean, so different tools. Our options are:
My preference is option 4. Thoughts? |
Option 5. grandfather v4.2 or earlier, as ambiguous. Make an unambigous choice if the header is v4.3 |
Most integrated WGS/NGS germline SV calling pipelines that I've seen being developed for large scale studies include both PE/SR/ASM based calls that would have breakend support as well as depth based calls that don't. You just can't capture the breakpoints of a lot of germline CNVs with short reads, and I think we'll be getting lots of mixed VCFs in the future as integrated pipelines are more widely deployed.
Just to clarify, are you suggesting something like a SVBKPT INFO field, which, if present, clarifies that there is a breakend claim? |
Just chiming in here w/ opinion that tandem duplications & repeat expansions are different from CNVs & should be considered (a) separate entit(y|ies) (e.g. having annotations from sequence, times). Otherwise liking the move towards resolving ambiguous annotations... But IMO a column for a second reference (chro) is needed, when keeping the columnar format :-) |
Doesn't PRECISE/IMPRECISE achieve what INFO.SVBKPT would? (Although I've seen examples that are both PRECISE and have CIPOS and CIEND, which I find confusing). And on a tangent, what is the likelihood of a Tabix index optionally supporting CIPOS/CIEND? |
|
|
Usually IMPRECISE is only used by callers that are using paired-end mapping signal to identify variants that show a read mapping signature but for which the exact breakpoint cannot be determined, either because there are no split reads or assembly-based evidence at the site, or the caller didn't look for it. This is different from a depth-based CNV caller, which is not making a claim about the breakpoint structure -- it's just reporting that an extra copy of a reference segment exists in the sample. The breakpoint structure of a CNV call could be much more complex than a simple tandem duplication -- imagine that segments from two chromosomes are duplicated, joined together, and inserted on a third chromosome. If the spec said something like, "if an SVTYPE=DUP variant is marked IMPRECISE, it is making a claim of a tandem duplication", it might have the same effect as the SVBKPT INFO field I was talking about above (or some other similar solution), but I think it would be a bit backwards from its original intent and hard to comprehend. |
@d-cameron I think my top preference from your list is your option 4: Keep DEL / DUP ambiguous but add an INFO field indicating whether there's a breakpoint claim of a simple deletion or tandem duplication (which can then be used with IMPRECISE and related fields). I would also consider option 5 (as I understand it this is making DUP and DEL into breakpoint claims and requiring non-tandem DUPs to be written with SVTYPE=CNV) as long as the new version of the spec contained strong, clear language warning about the difference in the interpretation, particularly of DUP, from older VCFs. Do you know if any SV tools are currently writing VCFs marked 4.3? I would hate to change the interpretation of variants from tools that are in active use without changing the VCF version number. Just to be clear, I support the goal of better codifying symbolic SV alleles so that it's easier to produce an less ambiguous sequence interpretation. |
One more comment on CNVs: cohort-based CNV callers can disentangle different copy number variable alleles even without breakpoint evidence. For example, in the case of overlapping duplications with clearly different boundaries (but for which we don't know the breakpoint structure):
There are three "copy number variable regions" here, but you can sometimes use phasing and parsimonious modeling of allele frequencies to distinguish each sample's genotype for the DUP1 and DUP2 alleles. If we restrict non-breakpoint based claims to use the |
I suggest we adopt something like @cwhelan 's working definition of Option #5, perhaps: “DUP and DEL are breakpoint claims; non-tandem DUPs (i.e. strictly copy number based claims) should be written as SVTYPE=CNV.” However, there is a consequence: moving dispersed DUPs to CNV requires broadening the definition of THAT category to include single-copy-gain dispersed dups. The August 22, 2019 pdf of v4.3 uses the following definition for CNV: “Copy number variable region (may be both deletion and duplication)” or, as pointed out earlier in the current discussion (#465 (review)): “Copy number variable region (multiallelic)” A freshly-redefined ‘CNV’ would include all of the following variants: #3, as defined above, can reasonably be called ‘biallelic’. In contrast, the very broad perception is that “SVTYPE=CNV” implies multiple alleles” (in fact, gnomAD's SV VCF uses 'MCNV' instead of 'CNV' - maybe this is a change we want to consider adopting in the spec). So the ‘CNV’ definition will need to be re-written to include the new kid on the block. |
I think it may also be important to recognize, if only conceptually, that any called insertion can represent (depending on the nature and extent of the analysis involved) a dispersed duplication or a tandem duplication. Distinguishing among these possibilities requires an analysis of the insertion’s sequence content and immediate genomic context. |
Hi, all, Chris Whelan brought this discussion to my attention. From my perspective, it would be ideal if the VCF specification allowed these kinds of representations.
Older versions of the code used
I would also like to say that while I don't expect such representations to be standardized to the point of interoperability or diagnostic use, I do think it is useful to allow the VCF format to have some flexiblity, enough to permit experimentation and innovation within the standard through the use of certain conventions, such as allele naming conventions like I describe above, which might only be understood by certain tools. At the same time, I think it is good if it is easy tools encountering an unrecognized allele format to not make any special assumptions about it. If VCF does not have this flexibility, then I think the alternative will to use other non-standard file formats, which I think leads to more file conversion, more friction trying to get tools to work together, more potential for errors, etc. |
@cwhelan I can see your point. How about this then? Consensus proposal, version 2
It's pretty much already how all of this is supposed to work, but my point is that it needs to be explicitly and very carefully worded in the specification. |
(Sorry for the confusion, I misclicked and sent the message before it was ready, so I deleted it. This is the complete message.) After reading multiple times this thread and the related ones, and also the current spec, I think the purpose of the current spec writing (before this PR) was: symbolic SV ALTs:
With that, I could see that a read depth claim DUP, can use ALT=DUP and any SVTYPE (possibly DUP for simplicity), but SVTYPE becomes useful for a breakpoint claim where you use the breakend notation in ALT (to make a breakpoint claim) and put the SVTYPE=DUP. I'm not saying we should keep this as it is, I'm just trying to understand the history of the current writing and whether it can be clarified or needs a breaking rewrite. This is kind of similar to the last proposal by @tskir, where a read depth claim is specified with ALT=CNV and anything else is breakpoint claim. If we go with the meaning I just explained, from tskir's comment I can see how it may not apply to INS and INV if those can not be identified by a read depth analysis (I'm no expert on that field), but I wonder if the main problem @d-cameron explained offline (about other callers misusing the ALT and SVTYPE fields) still applies? the callers express the evidence in ALT (BND is breakpoint claim, anything else is read depth claim; or we can change to a similar combination as tskir's one), and the interpretation is expressed in SVTYPE. Please let me know if you have seen callers that do not comply with this split. One concern with that approach is if breakpoint claims are not a superset of read depth claims, and if it would make sense to be able to state both at the same time. Also, tskir, how do you classify a non-tandem DUP with known location in your suggestion? With a breakend in ALT? is DUP:TANDEM unnecessary then? |
@jmmut You raised some very good points. I also had to re-read parts of the specification to address them.
You were quite right to notice that BND is different from other symbolic allele types, and that the other types do not make breakpoint claims. However, those other types are not read depth claims either in the specification. “Read depth” refers to a specific set of methods of (imprecisely) detecting increase and decrease in segment copy numbers. Rather, the specification makes the distinction between “precise” and “imprecise” calls, regardless of the method of detection. In section 1.4.5 “Alternative allele field format”, subsection “Structural Variants” starts:
That means that if you have an imprecise structural variant (meaning it has not been detected up to base pair resolution, using whatever method), you specify it using:
The current wording of section 1.4.5 leaves it ambiguous what to do with precise structural variants — e. g. when you know coordinates of a huge deletion up to single nucleotide resolution. Some examples use the
I'm not necessarily against doing it this way, but:
In light of the points you raised, I think I have an idea for a better proposal which would be much more consistent and also mostly compatible with the current specification version. I will post it shortly. |
Based on feedback from @cwhelan and @jmmut, I present to you: Consensus proposal, version 3Retain the same SV types for symbolic alleles; expand & clarify their definitionsThe types currently present in the specification are just fine, but poorly defined (it took me and @jmmut a couple of days to understand their true indended meaning). Let's define them very explicitly:
Explicitly make SV calls of all types imprecise by default. Mark precise calls using CIPOS and CIENDAgain, this already looks like the way the current specification is intended to work, it's just not clear. Let's explicitly say that by default all structural symbolic alleles denote an approximate variant location with the start/end position as best estimates. To make a “breakpoint claim”—that is, to specify that the start and/or end of a variant are known to single base resolution—existing CIPOS and/or CIEND fields must be set to (0, 0) values. (Alternatively, if people prefer, we could set up a special value, for example Exactly synchronise the lists of top level structural symbolic alleles and SVTYPE valuesIn this verison of the proposal, I'm back with my suggestion to completely synchronise the allowable values of the two lists. Since there will be already a way to discern exact (breakpoint) claims from inexact (e. g. read depth) claims, there is no need to mix the terms together between the symbolic alleles and the SVTYPE. As far as I can see, this proposal addresses all concerns raised by @cwhelan:
And the ones by @jmmut:
@d-cameron @cwhelan @jmmut Please let me know what you think. |
Your proposal is internally consistent but I think it misunderstands a couple of historical things and it moves away from the original intent of this PR, which is trying to figure out a way to distinguish between copy number and breakend-based claims so that downstream tools can figure out how to (at least partially) reconstruct the haplotype altered by the event, while not requiring every tool to write BND-formatted records (which, while complete, are extremely non-human readable and resistant to quick analyses). Historically and at the moment, both copy number-based calling methods and those based on read pair and split read mappings have used the same SVTYPEs and ALTs, leading to difficulty in interpretation. For example, the former class of tool might detect elevated copy number of a region and create an IMPRECISE and CIPOS are used when the event is detected by read pairs or similar evidence that gives an indication of breakpoint structure, but the exact breakpoint coordinate is not known. Copy number based tools typically don't use those fields historically, since they are usually based on segmenting the genome arbitrarily into bins in which read depth is measured without any claim as to the breakend structure. Your comment yesterday:
Was a pretty good summary of the change in semantics that @d-cameron is trying to introduce into to the spec (with the backing of a group of us listed above who participated in calls organized by @thefferon several years ago). Rather than "In case breakpoints are unknown or not reported" I would say, "if the call makes no claim about breakpoint structure", however. In this case SVTYPE is providing the interpretation of the variant and the ALT can be used to derive breakpoint claims, as you stated earlier. I think that the changes in Daniel's PR are good and just need, as you said before, some careful and comprehensive wording and examples to flesh them out. |
Thank you for sharing your thoughts. I drafted proposal v3 following comments by @jmmut (while still taking into account your feedback on v1), and it seems to me that it better captures the original intent of the specification, introduces fewer potentially breaking changes, and is overall better and more consistent than v2. I am not pushing for it, and if we have to go with v2 I will reluctantly agree, but let me try and defend v3 a bit. I totally get the original intent of this PR and the necessity to separate different types of calls. Here's how v2 and v3 approach this using different means: Duplications (not necessarily tandem)
★ You mentioned that tools based on read depth don't usually use CIPOS and CIEND because of how their algorithms work. However, it is still possible to estimate uncerntainties in those kinds of approaches. For instance, if a copy number increase is detected as starting in bin #N and ending in bin #M, I think it quite makes sense to use start and end coordinates of those boundary bins to infer CIPOS and CIEND. It makes a difference whether the bins were 500 nt sized or 1,000,000 nt. For comparison: CNV region (both deletions and duplications present)
From what I see, v3 has the following advantages over v2:
If I understand correctly, your primary concern with v3 is that case 1 is indistinguishable from case 2, as well as case 3 from case 4. But I think that this is exactly the point, as they represent the same events inferred by different methods. By making CNV represent read depth claims in v2, we are tying that symbolic allele type to a specific detection method. I think it is much better when the calls themselves are completely agnostic to the detection technology, as implemented in v3 (and again, as appears to have been intended by the existing specification wording). Having said that, I realise that in many cases there is a need to know how the call was produced—I 100% support that. But I think this is much better done by e. g. introducing a specific (possibly even non-optional or highly encouraged) INFO field with a list of standardised detection types, e. g.:
I believe someone mentioned this possibility during one of the recent calls (can't remember who unfortunately), but did not press this further. I think this is the best approach, as it encodes useful additional information in an INFO field rather than cramming it into the symbolic allele type. @d-cameron @cwhelan @jmmut Please do let me know what you think. |
My concerns with proposal 3 are:
P3 requires 3 fields when we can disambiguate with 2. It still does not clarify what alt=, svtype= actually means.
This problem still exists in p3 - it just requires yet another field to determine what their meaning is.
It sounds very much like the difference between p2 and p3 is whether we reuse SVTYPE for disambiguation or deprecate SVTYPE and use a new field. Less complex is always better than more complex when we're dealing with specifications. I don't think the utility of this additional field justifies the additional complexity.
The vocabulary must be either open or close. Both have issues. An open vocabulary would be required as new technologies will get introduced in the future but to unambigously interpret variants then the vocab needs to be closed. Realistically, new technology callers will just lie about their evidence and choose whatever pre-defined value corresponds to the claim being made. E.g. using your closed vocab, a microarray caller will just write READ_DEPTH, and a read-pair based caller will just write SPLIT_READ. This entirely defeats the purpopse of having tech type.
I agree with this. VCF is currently technology agnotic and it should remain so. It shouldn't matter whether read depth or microarray probe intensity is used to define a CNV, merely that a CNV claim is being made. My issue with p3 is that it's not tech agnostic. You need to look up the tech source field to determine what claim is being made when a VCF has a DUP record.
I always interpreted an unspecificied confidence interval implicitly indicates an interval of [0,0] and my caller only writes CIPOS when the confidence interval is non-zero. |
Case 1 and 2 being indistinguishable is exactly my primary concern. Yes a simple tandem duplication looks the same for both case 1, and case 2, but case 1 and 2 are making different claims about the structure of the genome. For a genome with segments ABC:
These are not claiming the same thing. An actual simple tandem duplication requires a) both of these claims to be true and b) there are no other claims interfere with B, c) there is reference allele support for an AB and/or AC transition. Case 1 allows for ABBC or ABC / circular B And that's before we get into the possibility of missing or related events. case 1 could be part of a complex rearrangement and there may well be no copy number change of B at all. To reiterate: case 1 and 2 are making very different claims about the structure of the genome. VCF needs to be able to unambigiously specify the actual claims being made. |
@d-cameron After your comments I now understand your issues with proposal v3 much better, and I see the way to fix them. Let's continue the discussion on this: 1. Technology type vocabularyI agree that a new “detection technology” field with either an open or a closed vocabulary is a bad idea, for the reasons you described. Let's ditch it. The claim type can be specified without it—see below. 2. Position uncertaintiesSince this PR deals with claim types, not uncertainty specification, we can clarify and standardise CIPOS handling in a future separate PR. It has no direct effect on proposal v2 vs. v3 discussion. 3. Specifying claim typesYour example with the A/B/C genome was very informative. I now see what you mean by different “claim types”. It looks to me that we actually have the same view of the situation, we've just been using different terms for the same thing.
The idea in my example was that cases 1 and 2 do represent and claim exactly the same thing. Note that the header for cases 1–5 is “Duplications (not necessarily tandem)”. The idea is that For tandem duplications, in v3 you don't use 4. Redundancy and complexity of proposal v3
Without the technology vocabulary, the (updated) proposal v3 will also use only the two existing fields, ALT and SVTYPE. It will clarify the meaning of specific types and subtypes of structural variants in the way I described above: for example, with Also, if we are allowed to be bold and deprecate SVTYPE, since it does become redundant in this updated version of proposal v3, we can specify all information about the claim type using just one field, the symbolic allele type/subtype. I think it would the most elegant solution of all, utilising only the minimal number of existing fields, and having decent backwards compatibility. @d-cameron @jmmut @cwhelan Please let me know if you have any additional feedback. |
Design goals for this PR are:
Unfortunately, 1 and 2 are incompatible so we have to break something. Realistically, we're not going to have a large number of pre-4.4 VCFs around for a very long time. As much as I'd like to say that all breakpoint-based SV caller should report everything in BND notation, that's not going to happen.
If DUP is a CN claim, DUP:TANDEM is a breakpoint claim, then how do you make a claim of an actual tandem duplication (ie CN + breakpoint)? I know it's very late in this discussion, but I'll just put this out as an option for dicussion: Proposal 3a: explicit specification of support type Proposal 3 has the fundamental problem that it fails design goal 2 (eg the DUP example above). This can be resolved with an additional Practically, this approach has a few advantages. One that I particularly like is that downstream v4.4-aware SV event classifiers tools will already be dealing with the current ALT/SVTYPE mess, and it'd relatively simple to specify the pre-4.4 VCF out of band. E.g. |
For me this boils down to my initial confusion, where I thought that a breakpoint claim included the CN claim. After some of your explanations I understood how they are independent, and the class of CN+breakpoint is different than only breakpoint. For this, I see how v3 as we have discussed is inherently flawed as you could only specify CN or CN+breakpoint, and if we change v3 to allow the 3 claims, it will be basically the same as the changes in this PR. The alternative (partially) done in this PR is still not clear to me about how to identify a CN, BP or CN+BP using only ALT+SVTYPE, but we can keep trying to make it clearer. At this point I'm a bit skeptical we can make it very clear. The clearest to me would be something similar to the PR in the sense that ALT=[INS,DEL,INV,DUP] is CN+BP; ALT=CNV is CN and ALT=BND is BP; and SVTYPE is used to explain the common type (INS,DEL,INV,DUP) if ALT=CNV and ALT=BND. SVTYPE=[BND,CNV] could be used for types that are more complex than the other 4 simple types. But this is problematic if someone wants to state CN+BP using the bracket notation (I don't know if this is a real problem or if there are other possible complications). For this last reason I think I prefer the newest option: The third idea of "SVCLAIM with allowable values of CN, BP, and CNBP" has the clarity I was hoping to get from reorganising ALT and SVTYPE, but the biggest problem I see with that is that people will stick to only ALT+SVTYPE unless we make SVCLAIM required. It looks a bit of a pain, but I think it would be justified to make SVCLAIM required for SVs. It would still make sense to clarify (and simplify?) ALT+SVTYPE, but this becomes non-critical. |
A sane option is to just have There is IMO no good use case, in a VCF, to have something tagged as Tandems, precisely defined indels should be just that. Users can decide themselves from which size on they parse them as CNVs, and how to do that. Please keep (rather, make) it simple. |
@mbaudis Events need to have type CNV if they are measuring total copy number of a segment and you can't confidently say that there is only one type of alt allele (DEL or DUP) at the locus.. unless you have another proposal for how to represent that use case. @d-cameron The SVCLAIM idea seems good to me at first glance. Our integrated WGS pipeline (descended from the gnomAD SV pipeline) includes events called both by depth signal only, breakpoint signal, or both, in one final output vcf. We already track a value roughly equivalent to SVCLAIM; standardizing it in the spec would make sense to me. |
How about versus Might this rule change to centre-positioning in the next version of VCF? Discuss. |
Good catch, I didn't notice that bit of the specs.Left-aligning is the wrong choice for SVs because a) it can't be done for +/+ or -/- oriented breakpoints (left alignment of one side causes right alignment of the other), and b) for RP-based caller, left-alignment frequently results in a called position that is the least likely of the positions in the CIPOS interval. SVs should be either centre-aligned or aligned to the most likely position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-cameron Here's the fresh round of reviews, as promised. (Sorry it comes at the latest possible moment before the deadline!)
Once the remaining comments are addressed, I think we can go ahead and merge this, and then tweak as necessary to align with subsequent PRs (such as the SVCLAIM one etc.)
I've also resolved several remaining discussion and extracted them into separate issues to keep the scope of this PR reasonable.
@@ -173,30 +173,49 @@ \subsubsection{Individual format field format} | |||
Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field). | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the scope for this is going to be VCF 4.4, could you please rebase these changes against VCFv4.4.draft.tex
?
\begin{tabular}{l l} | ||
DEL & Deletion relative to the reference \\ | ||
INS & Insertion relative to the reference \\ | ||
DUP & Tandem duplication relative to the reference \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DUP & Tandem duplication relative to the reference \\ | |
DUP & Duplication relative to the reference. This refers to any quantitative increase of the number of alleles compared to the reference genome, without indication about the physical location of the additional copies \\ |
As discussed, DUP is going to be just a generic “duplication” and the details will be provided in a separate SVCLAIM field #517.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss that "duplication" refers to a "any quantitative increase of the number of alleles compared to the reference genome, without indication about the physical location of the additional copies". I.e. tandems can be a subset, as can be e.g. extrachromosomal amplifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. That's an excellent phrasing as well, I've added it to the suggested change now
\noindent Variants should be written using the most precise type that can be determined by the variant caller. For example, if the insertion site of a new copy of a LINE1 element cannot be determined, it would be specified as a DUP of the originating LINE1 element. However if the new insertion site can be identified, the variant should be specified as INS:ME:LINE1 at the insertion site.\newline | ||
|
||
\noindent Note that the DUP type is restricted to simple tandem duplications. More complex duplications should be specified using BND notation.\newline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\noindent Variants should be written using the most precise type that can be determined by the variant caller. For example, if the insertion site of a new copy of a LINE1 element cannot be determined, it would be specified as a DUP of the originating LINE1 element. However if the new insertion site can be identified, the variant should be specified as INS:ME:LINE1 at the insertion site.\newline | |
\noindent Note that the DUP type is restricted to simple tandem duplications. More complex duplications should be specified using BND notation.\newline | |
\noindent Variants should be written using the most precise type that can be determined by the variant caller.\newline |
Looks like these examples contradict the approaches discussed in #517, so I suggest we remove them in this PR and then add more up-to-date examples as applicable.
\item BND: Breakend | ||
\item DEL: Deletion relative to the reference | ||
\item INS: Insertion relative to the reference | ||
\item DUP: Tandem duplication relative to the reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\item DUP: Tandem duplication relative to the reference | |
\item DUP: Duplication relative to the reference |
Again, looks like this no longer applies and is superseded by #517.
#553 can we just kill SVTYPE for v4.4? We have to redefine it as As it currently stands, I think the burden of having SVTYPE outweights the value it adds to the specifications and it should be removed. It can be entirely inferred from ALT (implementation could even expose a read-only SVTYPE in their APIs for backwards compatibiilty), and it's not that much greater a burden on the quick and dirty parsing |
@d-cameron Personally, I lean mostly in favour of killing SVTYPE, the reason being precisely its complete reduncancy. Implementations exposing a read-only attribute sounds like a great idea as well. However, I really think we should separate the changes to avoid the PRs blocking each other. Do you think that, for the time being, we could rebaseline and merge this PR as it is, and then get back to SVTYPE discussions in #553? |
Current issues:
There's not really much left in this PR except for clarifications of reserved SV symbolic ALT alleles, and some additional examples. Should I move these into their own PRs and address there? |
@d-cameron Regarding this:
Yes, I think moving each of those changes into a separate PR would work best. |
Various drafts of improved Structural Variant support have been floating around for 2.5 years (see #231, #266) but never merged. I'm attempting to collate everything together in this PR but it is unclear as to what is in scope.
The following changes were agreed backed in 2017 by Cristina Yenyxe Gonzalez, Steve Huang, Daniel Cameron, Xuefang Zhao, Tobias Rausch, Tim Hefferon, John Lopez, Chris Whelan:
Additional changes that I'd like to see are:
@lbergelson @yfarjoun @pd3 What's the best way forward from here that minimises the chances of us sitting on PRs for another 2 years?