Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Structural Variant Support #465

Closed
wants to merge 6 commits into from

Conversation

d-cameron
Copy link
Contributor

@d-cameron d-cameron commented Dec 18, 2019

Various drafts of improved Structural Variant support have been floating around for 2.5 years (see #231, #266) but never merged. I'm attempting to collate everything together in this PR but it is unclear as to what is in scope.

The following changes were agreed backed in 2017 by Cristina Yenyxe Gonzalez, Steve Huang, Daniel Cameron, Xuefang Zhao, Tobias Rausch, Tim Hefferon, John Lopez, Chris Whelan:

  • Restrict SVTYPE to the 6 primitives
    • SVTYPE will be re-cast as a "basic primitive"; its distinction from EVENTTYPE will be made clear in the spec
    • SVTYPE will have the following closed controlled vocabulary: DEL, INS, DUP, INV, CNV, BND
    • No colons or subtypes are allowed in SVTYPE value
    • All SV VCF records must include a value for SVTYPE
  • EVENTTYPE will be added to the spec. It will contain the "biological interpretation" of the variant
    • EVENTTYPE will have an open controlled vocabulary, to include: INS, MEI, ALU, L1, SVA, HERV, DEL, -MEI-, -ALU-, -L1-, -SVA-, -HERV-, INV, DUP, DELINS, CNV
  • EVENT will continue to be used to specify identifiers to link together multiple VCF records
    • EVENT is currently discussed primarily in the graphical BND-notation section of the spec; text will be added elsewhere, in relevant context area(s) of the spec.
  • CIPOS and CIEND will continue to be used as they have in the past: as the primary means of representing an interval within which a breakpoint is likely to fall.
    • The spec will be amended with examples, as needed, to illustrate the preferred usage of CIPOS and CIEND.
  • New tags CIPOSPROB and CIENDPROB will be introduced to the spec as a means to indicate the level of confidence implied by CIPOS and CIEND, respectively.
    • Each CI*PROB value shall contain two (2) values separated by a comma. These values shall represent the proportion of a normal distribution expected to fall outside the values recorded in CIPOS. For example, CIPOS=(-50,100) and CIPOSPROB=(0,0.95) indicate the probability that the breakpoint lay more than 50 bp to the left of POS is zero, and the probability that it lay more than 100bp to the right of POS os 0.95.
  • New tag HOMPOS will be introduced to the spec. Its value shall represent the coordinate on the given contig of the first basepair of the microhomology indicated by HOMSEQ.
    • The definition of HOMPOS will be: Position (relative to POS) of base pair identical micro-homology around event breakpoint. Note that length(HOMSEQ)=HOMLEN=HOMPOS[1]-HOMPOS[0]
  • BND will be officially added to the SVTYPE controlled vocabulary, and it will be referenced consistently throughout the spec (unlike now). It will NOT be replaced by other terms that were discussed, such as TRA (for translocation) or ADJ (for adjacency).
  • Examples of each type of event shall be drawn up and added to the spec, including one in "standard" notation and another in the equivalent "BND" notation. Tim and Daniel will draft these examples, respectively.

Additional changes that I'd like to see are:

  • Either 1 SV per record, or SV fields counts standardised to handle multiple SV records
    • the latter was blocked by VCF not supporting list of lists
  • A field to resolve STR expansion ambiguity (e.g. HOMSTRIDE)
  • Sub-clonality support (for all variants)
  • genotyping support for somatic SVs. There are two issues with the current specs:
    • copy number != ploidy
    • maternal/paternal haplotypes are still meaningful for somatic SV, it's just that there can be many copies of each.
  • Karyotype reconstruction
    • a 'next SV' field is sufficient
    • PSL could be adapted to handle this
  • Explicit clarification around SVTYPE about what the claim is
    • currently both CNV and SV callers write DELs so it's unclear if a DEL is claim of a breakpoint adjacency, a segmental loss, or both

@lbergelson @yfarjoun @pd3 What's the best way forward from here that minimises the chances of us sitting on PRs for another 2 years?

commit ecd40f0
Author: Tim Hefferon <theffero@nih.gov>
Date:   Mon Dec 18 13:43:12 2017 -0500

    editorial improvements

commit a1b6ad4
Merge: 986e835 7aa24be
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Fri Dec 15 12:41:29 2017 -0500

    Merge pull request samtools#5 from samtools/master

    Fix RFC3986 encoding for "%"

commit 986e835
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:48:21 2017 -0500

    Update VCFv4.3.tex

commit c598717
Merge: 46042a8 4c18a2c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:41:20 2017 -0500

    Merge pull request samtools#4 from thefferon/revert-whitespace

    Update VCFv4.3.tex

commit 4c18a2c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:39:55 2017 -0500

    Update VCFv4.3.tex

commit 46042a8
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:31:48 2017 -0500

    Add files via upload

commit 1d3c079
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:35:38 2017 -0500

    cleanup samtools#1

commit 6d30b09
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:33:39 2017 -0500

    incorporated @d-cameron's changes from PR samtools#266

commit 1de4ac4
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:25:41 2017 -0500

    backed out whitespace changes

commit 406e9da
Merge: 489b696 ce1e750
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 6 09:11:52 2017 -0500

    Merge pull request samtools#3 from thefferon/thefferon-patch-1

    Add files via upload

commit ce1e750
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 6 09:11:20 2017 -0500

    Add files via upload

commit 489b696
Merge: 527e04b 2f915a8
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Tue Dec 5 15:54:37 2017 -0500

    Merge pull request samtools#2 from samtools/master

    bringing my fork up to date

commit 527e04b
Merge: 7c5259c 85a0fef
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Nov 16 16:21:31 2017 -0500

    Merge pull request samtools#1 from samtools/master

    bringing my fork up to date

commit 7c5259c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Aug 9 16:25:24 2017 -0400

    First go at changes requested in pull request samtools#231

    Still evaluating effect of these changes on pdf layout...

commit cf2ffa9
Author: Tim Hefferon <theffero@ncbi.nlm.nih.gov>
Date:   Tue Aug 8 11:22:22 2017 -0400

    Made significant updates to Section 3, INFO keys used for structural variants

# Conflicts:
#	VCFv4.3.tex
@d-cameron
Copy link
Contributor Author

See #448 for an example of how real-world tools are using fields in a non-compliant manner to work around the lack of proper subclonal support.

@jmarshall jmarshall added the vcf label Dec 18, 2019
Copy link
Contributor

@jmmut jmmut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typos, but overall I agree with merging. I think these changes are a good improvement on their own, so I'm fine with making other PRs to complete everything that was agreed.

VCFv4.3.tex Outdated Show resolved Hide resolved
VCFv4.3.tex Outdated Show resolved Hide resolved
VCFv4.3.tex Outdated Show resolved Hide resolved
VCFv4.3.tex Show resolved Hide resolved
VCFv4.3.tex Outdated Show resolved Hide resolved
VCFv4.3.tex Outdated Show resolved Hide resolved
@hts-specs-bot
Copy link

Changed PDFs as of e2beb9c: VCFv4.3 (diff).

@d-cameron
Copy link
Contributor Author

I've now stripped the bit where DUP subtypes defined different breakpoints than the root types. This badly breaks backwards compatability and can be better handled as part of the EVENT/EVENTTYPE PR.

Copy link
Contributor

@jmmut jmmut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be correct now, but can you clarify these points to me, please?

VCFv4.3.tex Outdated
\item BND: Breakend
\item DEL: Deletion relative to the reference
\item INS: Insertion relative to the reference
\item DUP: Region of elevated copy number relative to the reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that the description for DUP is different here in SVTYPE than the one in symbolic ALTs? I would change it here too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this seems to indicate that if you know there's a duplication but aren't sure if it's tandem or not, you should use SVTYPE=DUP and have an ALT of <CNV>. Is that the intent?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are VCFs out there that use <DUP> to indicate regions of elevated copy number rather than tandem duplications. Perhaps we should call out a warning about interpreting legacy VCFs given this change.

VCFv4.3.tex Show resolved Hide resolved
VCFv4.3.tex Show resolved Hide resolved
Copy link
Member

@tskir tskir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-cameron I must say that I really appreciate your decision to approach this with smaller PRs, they're so much easier to discuss than huge ones (which inevitably get stuck after a while)

VCFv4.3.tex Show resolved Hide resolved
VCFv4.3.tex Show resolved Hide resolved
VCFv4.3.tex Show resolved Hide resolved
VCFv4.3.tex Show resolved Hide resolved
@cwhelan
Copy link

cwhelan commented Jan 24, 2020

@d-cameron Added a few comments on places that I think could use clarifications -- sorry that I didn't make them sooner. Again, thanks for spearheading this change.

@d-cameron
Copy link
Contributor Author

Ok, there's quite a bit of discussion around DUP events. I think the real underlying issue is that we don't all agree on what a claim of a DUP is, and we already have an ecosystem in which different tools are making different claims.

Tools based on micro-array or copy number evidence report DUP as an a region of elevated copy number (typically +1 copy). They are making a claim about the number of copies of the duplicated region. Tools based on NGS report DUP when they find a breakpoint in the an orientation consistent with a tandem duplication.

In conclusion, the VCF specs don't actualy specify what symbolic structural variants actually mean, so different tools. Our options are:

  1. Define DEL/DUP as breakpoint claims
  2. Define DEL/DUP as CN claim
  3. Define DEL/DUP as both breakpoint and CN claim
  4. Add an additional field which is used clarify structural symbolic allele claims and existing calls remain ambiguous.

My preference is option 4.

Thoughts?

@d-cameron
Copy link
Contributor Author

Option 5. grandfather v4.2 or earlier, as ambiguous. Make an unambigous choice if the header is v4.3

@cwhelan
Copy link

cwhelan commented Jan 27, 2020

Tools based on micro-array or copy number evidence report DUP as an a region of elevated copy number (typically +1 copy). They are making a claim about the number of copies of the duplicated region. Tools based on NGS report DUP when they find a breakpoint in the an orientation consistent with a tandem duplication.

Most integrated WGS/NGS germline SV calling pipelines that I've seen being developed for large scale studies include both PE/SR/ASM based calls that would have breakend support as well as depth based calls that don't. You just can't capture the breakpoints of a lot of germline CNVs with short reads, and I think we'll be getting lots of mixed VCFs in the future as integrated pipelines are more widely deployed.

  1. Add an additional field which is used clarify structural symbolic allele claims and existing calls remain ambiguous.

Just to clarify, are you suggesting something like a SVBKPT INFO field, which, if present, clarifies that there is a breakend claim?

@mbaudis
Copy link

mbaudis commented Jan 27, 2020

Just chiming in here w/ opinion that tandem duplications & repeat expansions are different from CNVs & should be considered (a) separate entit(y|ies) (e.g. having annotations from sequence, times).

Otherwise liking the move towards resolving ambiguous annotations... But IMO a column for a second reference (chro) is needed, when keeping the columnar format :-)

@rhdolin
Copy link

rhdolin commented Jan 27, 2020

Doesn't PRECISE/IMPRECISE achieve what INFO.SVBKPT would? (Although I've seen examples that are both PRECISE and have CIPOS and CIEND, which I find confusing).

And on a tangent, what is the likelihood of a Tabix index optionally supporting CIPOS/CIEND?

@thefferon
Copy link

But IMO a column for a second reference (chro) is needed, when keeping the columnar format :-)
@mbaudis, I am not clear on what you mean here. Please elaborate.

@thefferon
Copy link

Doesn't PRECISE/IMPRECISE achieve what INFO.SVBKPT would? (Although I've seen examples that are both PRECISE and have CIPOS and CIEND, which I find confusing).
@rhdolin, by "PRECISE" do you mean the examples had a PRECISE flag, or just that they did not have the IMPRECISE flag? If the former, I admit to find that confusing as well.

@cwhelan
Copy link

cwhelan commented Jan 28, 2020

@rhdolin

Doesn't PRECISE/IMPRECISE achieve what INFO.SVBKPT would? (Although I've seen examples that are both PRECISE and have CIPOS and CIEND, which I find confusing).

Usually IMPRECISE is only used by callers that are using paired-end mapping signal to identify variants that show a read mapping signature but for which the exact breakpoint cannot be determined, either because there are no split reads or assembly-based evidence at the site, or the caller didn't look for it. This is different from a depth-based CNV caller, which is not making a claim about the breakpoint structure -- it's just reporting that an extra copy of a reference segment exists in the sample. The breakpoint structure of a CNV call could be much more complex than a simple tandem duplication -- imagine that segments from two chromosomes are duplicated, joined together, and inserted on a third chromosome. If the spec said something like, "if an SVTYPE=DUP variant is marked IMPRECISE, it is making a claim of a tandem duplication", it might have the same effect as the SVBKPT INFO field I was talking about above (or some other similar solution), but I think it would be a bit backwards from its original intent and hard to comprehend.

@cwhelan
Copy link

cwhelan commented Jan 28, 2020

@d-cameron I think my top preference from your list is your option 4: Keep DEL / DUP ambiguous but add an INFO field indicating whether there's a breakpoint claim of a simple deletion or tandem duplication (which can then be used with IMPRECISE and related fields).

I would also consider option 5 (as I understand it this is making DUP and DEL into breakpoint claims and requiring non-tandem DUPs to be written with SVTYPE=CNV) as long as the new version of the spec contained strong, clear language warning about the difference in the interpretation, particularly of DUP, from older VCFs. Do you know if any SV tools are currently writing VCFs marked 4.3? I would hate to change the interpretation of variants from tools that are in active use without changing the VCF version number.

Just to be clear, I support the goal of better codifying symbolic SV alleles so that it's easier to produce an less ambiguous sequence interpretation.

@cwhelan
Copy link

cwhelan commented Jan 28, 2020

One more comment on CNVs: cohort-based CNV callers can disentangle different copy number variable alleles even without breakpoint evidence. For example, in the case of overlapping duplications with clearly different boundaries (but for which we don't know the breakpoint structure):

DUP1:          |-------------|
DUP2:                       |------------|

There are three "copy number variable regions" here, but you can sometimes use phasing and parsimonious modeling of allele frequencies to distinguish each sample's genotype for the DUP1 and DUP2 alleles. If we restrict non-breakpoint based claims to use the <CNV> alt allele, I think we'd have to change the description of the alt allele in the spec away from Copy number variable region to something more like Allele that changes the copy number of the reference segment. We'd probably also then need an additional INFO field to say in what direction and by how much the copy number changes for that particular allele. To me this lends support for my preference for your option 4 (allowing the use of <DUP> as the alts for these variants, and adding INFO fields for breakpoint support if they are tandem/simple deletions).

@thefferon
Copy link

I suggest we adopt something like @cwhelan 's working definition of Option #5, perhaps: “DUP and DEL are breakpoint claims; non-tandem DUPs (i.e. strictly copy number based claims) should be written as SVTYPE=CNV.” However, there is a consequence: moving dispersed DUPs to CNV requires broadening the definition of THAT category to include single-copy-gain dispersed dups.

The August 22, 2019 pdf of v4.3 uses the following definition for CNV:

“Copy number variable region (may be both deletion and duplication)”

or, as pointed out earlier in the current discussion (#465 (review)):

“Copy number variable region (multiallelic)”

A freshly-redefined ‘CNV’ would include all of the following variants:
• regions in which both a duplication and a deletion have been observed
• regions one would call “multiallelic” – that is, in which three or more alleles (reflecting discrete copy number increments) have been reported
• the “new kid on the block”: dispersed duplications (for now let's limit these to increases of just one copy number relative to reference)

#3, as defined above, can reasonably be called ‘biallelic’. In contrast, the very broad perception is that “SVTYPE=CNV” implies multiple alleles” (in fact, gnomAD's SV VCF uses 'MCNV' instead of 'CNV' - maybe this is a change we want to consider adopting in the spec). So the ‘CNV’ definition will need to be re-written to include the new kid on the block.

@thefferon
Copy link

I think it may also be important to recognize, if only conceptually, that any called insertion can represent (depending on the nature and extent of the analysis involved) a dispersed duplication or a tandem duplication. Distinguishing among these possibilities requires an analysis of the insertion’s sequence content and immediate genomic context.

VCFv4.3.tex Show resolved Hide resolved
@bhandsaker
Copy link

bhandsaker commented Jan 28, 2020

Hi, all,

Chris Whelan brought this discussion to my attention.
I want to share some of the use cases I currently implement in Genome STRiP and other software we develop/use in our lab.

From my perspective, it would be ideal if the VCF specification allowed these kinds of representations.
In general, I also think it is useful to allow the VCF format to be flexible enough to permit tool-specific extensions in a spec-compliant VCF.

  1. SVTYPE. Although we could do this with a different tag, we currently use SVTYPE to default certain genotyping behavior.
    As one example, if SVTYPE=DEL, then we will never assign INFO:CN > ploidy and all genotype likelihood calculations assume a bi-allelic variant where one allele is the reference and the other a deletion allele. This is in contrast to SVTYPE=CNV, which allows INFO:CN to be greater or less than ploidy.

  2. For unphased copy number variants, we use <CNV> as the ALT allele. The code is also not picky about the ref allele, and in particular N is allowed (regardless of the actual reference base at POS). We use the representation, in particular, when there is uncertainty about the true breakpoint (for example, if there are segmental duplications) or when the code does not want to make any assertion about the breakpoint location.

  3. For phased copy number variants (or for "partitioned" but not phased variants), we use the notation
    <CN:n> to represent the allelic copy number.
    For example, something like this (tabs changed to spaces to condense):

#CHROM POS  ID REF    ALT           QUAL FILTER INFO FORMAT SAMP1 SAMP2 SAMP3 SAMP4
chr1   1000 .  <CN:1> <CN:0>,<CN:2> .    .      .    GT:CN  0|0:2 0|1:1 2|2:4 1|2:2

Older versions of the code used <CNn> (with no colon) as the allele representation, but this caused difficulties because strict interpretation of the specification said that all ALT alleles had to be defined in the header and this in turn required knowing in advance all of the alleles that might be encountered or writing the output file in two passes. The idea behind <CN:n> is that the allele type/template is defined once in the header, but n is variable.

  1. For complex structures, such as the C4 locus (Sekar, 2016; Kamitaki, 2020 (under review)) and other examples, we represent the structural haplotypes as particular alleles with encodings of the structure representing the key biology. For example, at C4, we previously defined ALT alleles like <H_n_n_n_n_x> for example <H_2_1_1_1_B>. The encoding is described in context (or in the description field).
    For C4, the encoding is the (allelic) copy number of total C4, C4A, C4B, the HERV element and an optional character suffix identifying a particular haplotype (among structurally equivalent, recurrent haplotypes that appear to have arisen independently in humans).
    As one additional example, we might label an allele as <H_3_1_2_3_insCT> representing a haplotype that carries 1 copy of C4AL, two copies of C4BL, and a common frame shift variant of potential phenotypic importance.

    While there are other representations one could choose, these were convenient, reasonably human readable (at least for us) and worked well with other tools, such as beagle, without requiring modifications to the downstream tools.

I would also like to say that while I don't expect such representations to be standardized to the point of interoperability or diagnostic use, I do think it is useful to allow the VCF format to have some flexiblity, enough to permit experimentation and innovation within the standard through the use of certain conventions, such as allele naming conventions like I describe above, which might only be understood by certain tools.

At the same time, I think it is good if it is easy tools encountering an unrecognized allele format to not make any special assumptions about it.

If VCF does not have this flexibility, then I think the alternative will to use other non-standard file formats, which I think leads to more file conversion, more friction trying to get tools to work together, more potential for errors, etc.

@tskir
Copy link
Member

tskir commented Feb 12, 2020

@cwhelan I can see your point. How about this then?

Consensus proposal, version 2

  • For breakpoint claims, top level of the symbolic structural allele type must be one of {<DEL>, <INS>, <DUP> (only tandem), <INV>, <BND>}. In this case, SVTYPE must match the top level of the symbolic structural allele exactly.
  • In case breakpoints are unknown or not reported, symbolic allele must be <CNV>. In this case, SVTYPE must be one of exactly three types:
    • SVTYPE=DEL — copy number decrease compared to the reference;
    • SVTYPE=DUP — copy number increase compared to the reference (in this case the implied duplication is not necessarily tandem);
    • SVTYPE=CNV — multiallelic copy number region of both increase and decrease of copy number compared to the reference.

It's pretty much already how all of this is supposed to work, but my point is that it needs to be explicitly and very carefully worded in the specification.

@jmmut
Copy link
Contributor

jmmut commented Feb 12, 2020

(Sorry for the confusion, I misclicked and sent the message before it was ready, so I deleted it. This is the complete message.)

After reading multiple times this thread and the related ones, and also the current spec, I think the purpose of the current spec writing (before this PR) was:

symbolic SV ALTs:

  • DEL INS DUP INV CNV: read-depth claim. E.g. see the current wording for DUP: "Region of elevated copy number relative to the reference". CNV is the general category: DEL is equivalent to CNV/CN0, INS is CNV/CN1+ of new sequence, DUP is CNV/CN2+, etc. and the specific should be preferred over the general CNV.
  • BND: breakpoint claim. Different claim than the other ALTs.

With that, I could see that a read depth claim DUP, can use ALT=DUP and any SVTYPE (possibly DUP for simplicity), but SVTYPE becomes useful for a breakpoint claim where you use the breakend notation in ALT (to make a breakpoint claim) and put the SVTYPE=DUP. I'm not saying we should keep this as it is, I'm just trying to understand the history of the current writing and whether it can be clarified or needs a breaking rewrite.

This is kind of similar to the last proposal by @tskir, where a read depth claim is specified with ALT=CNV and anything else is breakpoint claim.

If we go with the meaning I just explained, from tskir's comment I can see how it may not apply to INS and INV if those can not be identified by a read depth analysis (I'm no expert on that field), but I wonder if the main problem @d-cameron explained offline (about other callers misusing the ALT and SVTYPE fields) still applies? the callers express the evidence in ALT (BND is breakpoint claim, anything else is read depth claim; or we can change to a similar combination as tskir's one), and the interpretation is expressed in SVTYPE. Please let me know if you have seen callers that do not comply with this split.

One concern with that approach is if breakpoint claims are not a superset of read depth claims, and if it would make sense to be able to state both at the same time.

Also, tskir, how do you classify a non-tandem DUP with known location in your suggestion? With a breakend in ALT? is DUP:TANDEM unnecessary then?

@tskir
Copy link
Member

tskir commented Feb 13, 2020

@jmmut You raised some very good points. I also had to re-read parts of the specification to address them.

I think the purpose of the current spec writing (before this PR) was:
symbolic SV ALTs:

  • DEL INS DUP INV CNV: read-depth claim. [...]
  • BND: breakpoint claim. Different claim than the other ALTs.

You were quite right to notice that BND is different from other symbolic allele types, and that the other types do not make breakpoint claims. However, those other types are not read depth claims either in the specification. “Read depth” refers to a specific set of methods of (imprecisely) detecting increase and decrease in segment copy numbers. Rather, the specification makes the distinction between “precise” and “imprecise” calls, regardless of the method of detection. In section 1.4.5 “Alternative allele field format”, subsection “Structural Variants” starts:

In symbolic alternate alleles for imprecise structural variants, the ID field indicates the type of structural variant...

That means that if you have an imprecise structural variant (meaning it has not been detected up to base pair resolution, using whatever method), you specify it using:

  • DEL — for any decreased copy number (for example, CN2 → CN0 and/or CN1)
  • DUP — for any increased copy number (for example, CN2 → CN3, CN4 and so on). Since the specification reserves the additional DUP:TANDEM subtype, this implies that the “regular” DUP can be any type of duplication, including dispersed, more than one additional copy inserted, different orientations to the reference, etc.
  • CNV — only to be used when there are both DELs and DUPs in a single call
  • INS — when a novel sequence is inserted (this is not necessarily related to copy number change, since the sequence is marked as novel)
  • INV — when a portion of the reference sequence is inverted (this is also not related to copy number change)

The current wording of section 1.4.5 leaves it ambiguous what to do with precise structural variants — e. g. when you know coordinates of a huge deletion up to single nucleotide resolution. Some examples use the INFO/IMPRECISE key to indicate this, although this key is not officially reserved for this purpose.

but SVTYPE becomes useful for a breakpoint claim where you use the breakend notation in ALT (to make a breakpoint claim) and put the SVTYPE=DUP

I'm not necessarily against doing it this way, but:

  • BND notation is in general used for complex rearrangements which do not necessarily fit into the five simple SV types discussed above. Hence, it most cases it would be impossible to assign a standard SVTYPE to a BND.
  • The specification doesn't currently say a word about this, and actually all breakend examples in the current specification are using SVTYPE=BND. If we decide to allow specifying other SVTYPEs for BNDs, this needs to be explicitly clarified in the specification.
  • Currently, filtering on SVTYPE=BND is the only simple way to find all BND records in the VCF. You can't filter by symbolic allele, because BND are using a specific format for the ALT allele.

Also, tskir, how do you classify a non-tandem DUP with known location in your suggestion? With a breakend in ALT? is DUP:TANDEM unnecessary then?

In light of the points you raised, I think I have an idea for a better proposal which would be much more consistent and also mostly compatible with the current specification version. I will post it shortly.

@tskir
Copy link
Member

tskir commented Feb 13, 2020

Based on feedback from @cwhelan and @jmmut, I present to you:

Consensus proposal, version 3

Retain the same SV types for symbolic alleles; expand & clarify their definitions

The types currently present in the specification are just fine, but poorly defined (it took me and @jmmut a couple of days to understand their true indended meaning). Let's define them very explicitly:

  • DEL: any copy number decrease compared to the reference (for example, CN2 → CN1 and/or CN0).
  • DUP: any copy number increase compared to the reference (for example, CN2 → CN3 and/or CN4 and so on). By default this can refer to any type of duplication, including tandem or dispersed, adding one or several copies, same or different orientations to the reference.
    • DUP:TANDEM subtype: tandem duplication in the same orientation to the reference. This can also include more than one additional copy.
  • CNV — only to be used when there are both DELs and DUPs in a single call (region where increased and decreased copy number is observed).
    • For DEL, DUP and CNV specific copy number changes can be specified using INFO/CN and FORMAT/CN tags.
  • INS — when a novel sequence is inserted (this is not necessarily related to copy number change, since the sequence is marked as novel).
  • INV — when a portion of the reference sequence is inverted (this is also not related to copy number change).
  • BND — breakend notation, no changes to the current spec.

Explicitly make SV calls of all types imprecise by default. Mark precise calls using CIPOS and CIEND

Again, this already looks like the way the current specification is intended to work, it's just not clear. Let's explicitly say that by default all structural symbolic alleles denote an approximate variant location with the start/end position as best estimates.

To make a “breakpoint claim”—that is, to specify that the start and/or end of a variant are known to single base resolution—existing CIPOS and/or CIEND fields must be set to (0, 0) values. (Alternatively, if people prefer, we could set up a special value, for example CIPOS=PRECISE, for this purpose, but I think double zero works just fine.) If CIPOS and CIEND are not specified, it must be assumed that the call coordinates are not precise, but uncertainties are not available or not reported.

Exactly synchronise the lists of top level structural symbolic alleles and SVTYPE values

In this verison of the proposal, I'm back with my suggestion to completely synchronise the allowable values of the two lists. Since there will be already a way to discern exact (breakpoint) claims from inexact (e. g. read depth) claims, there is no need to mix the terms together between the symbolic alleles and the SVTYPE.


As far as I can see, this proposal addresses all concerns raised by @cwhelan:

  • There will be a simple way to query the VCF for all duplications, whether they were discovered using breakpoints or read depth, because they will all have ALT=DUP (possibly with subtypes) and SVTYPE=DUP.
  • There is a way to filter by precise and imprecise claims, using CIPOS/CIEND, without the need to mix different types in symbolic alleles and SVTYPEs.
  • The spec will not be left in a broken state.

And the ones by @jmmut:

  • This proposal retains as much backwards compatibility with the existing specification as possible; it only uses existing fields and is not introducing any new ones.
  • It decouples breakpoint (“preciseness”) information from the actual variant type, allowing the users to specify a broad range of both precise and imprecise structural variant types.
  • As for you question about specifying a non-tandem DUP with known insertion location: this will have to be done using BNDs; however, the current specification version also does not provide a way to do this, so there is no regression here. This new variant type can be possibly introduced in the future.

@d-cameron @cwhelan @jmmut Please let me know what you think.

@cwhelan
Copy link

cwhelan commented Feb 13, 2020

@tskir

Your proposal is internally consistent but I think it misunderstands a couple of historical things and it moves away from the original intent of this PR, which is trying to figure out a way to distinguish between copy number and breakend-based claims so that downstream tools can figure out how to (at least partially) reconstruct the haplotype altered by the event, while not requiring every tool to write BND-formatted records (which, while complete, are extremely non-human readable and resistant to quick analyses).

Historically and at the moment, both copy number-based calling methods and those based on read pair and split read mappings have used the same SVTYPEs and ALTs, leading to difficulty in interpretation. For example, the former class of tool might detect elevated copy number of a region and create an SVTYPE=DUP, ALT=<DUP> record. Read-pair based methods would detect read pairs mapping with a signature indicative of a tandem duplication, and use the same SVTYPE and ALT.

IMPRECISE and CIPOS are used when the event is detected by read pairs or similar evidence that gives an indication of breakpoint structure, but the exact breakpoint coordinate is not known. Copy number based tools typically don't use those fields historically, since they are usually based on segmenting the genome arbitrarily into bins in which read depth is measured without any claim as to the breakend structure.

Your comment yesterday:

For breakpoint claims, top level of the symbolic structural allele type must be one of {DEL,INS, DUP (only tandem), INV, BND}. In this case, SVTYPE must match the top level of the symbolic structural allele exactly.
In case breakpoints are unknown or not reported, symbolic allele must be . In this case, >SVTYPE must be one of exactly three types:
SVTYPE=DEL — copy number decrease compared to the reference;
SVTYPE=DUP — copy number increase compared to the reference (in this case the implied duplication is not necessarily tandem);
SVTYPE=CNV — multiallelic copy number region of both increase and decrease of copy number compared to the reference.

Was a pretty good summary of the change in semantics that @d-cameron is trying to introduce into to the spec (with the backing of a group of us listed above who participated in calls organized by @thefferon several years ago). Rather than "In case breakpoints are unknown or not reported" I would say, "if the call makes no claim about breakpoint structure", however. In this case SVTYPE is providing the interpretation of the variant and the ALT can be used to derive breakpoint claims, as you stated earlier. I think that the changes in Daniel's PR are good and just need, as you said before, some careful and comprehensive wording and examples to flesh them out.

@tskir
Copy link
Member

tskir commented Feb 14, 2020

@cwhelan

Thank you for sharing your thoughts. I drafted proposal v3 following comments by @jmmut (while still taking into account your feedback on v1), and it seems to me that it better captures the original intent of the specification, introduces fewer potentially breaking changes, and is overall better and more consistent than v2. I am not pushing for it, and if we have to go with v2 I will reluctantly agree, but let me try and defend v3 a bit.

I totally get the original intent of this PR and the necessity to separate different types of calls. Here's how v2 and v3 approach this using different means:

Duplications (not necessarily tandem)

Detection method Breakpoint uncertainty Proposal v2 Proposal v3
1. Inferred from read depth ✖ Unknown or not reported ALT=⁠<CNV>     
SVTYPE=DUP
ALT=⁠<DUP>     
SVTYPE=DUP
2. Inferred from split read mapping ✖ Unknown or not reported ALT=⁠<DUP>
SVTYPE=DUP
CIPOS=???
CIEND=???
ALT=<DUP>
SVTYPE=DUP
3. Inferred from read depth Estimated e. g. using bin coordinates ★ ALT=<CNV>
SVTYPE=DUP
CIPOS=(-1042,5173)
CIEND=(-8324,2775)
ALT=<DUP>
SVTYPE=DUP
CIPOS=(-1042,5173)
CIEND=(-8324,2775)
4. Inferred from split read mapping Estimated using analysis of adjacent reads ALT=<DUP>
SVTYPE=DUP
CIPOS=(-56,72)
CIEND=(-114,83)
ALT=<DUP>
SVTYPE=DUP
CIPOS=(-56,72)
CIEND=(-114,83)
5. Observed directly by long read sequencing ✔ Known exactly ALT=<DUP>
SVTYPE=DUP
CIPOS=???
CIEND=???
ALT=<DUP>
SVTYPE=DUP
CIPOS=(0,0)
CIEND=(0,0)

You mentioned that tools based on read depth don't usually use CIPOS and CIEND because of how their algorithms work. However, it is still possible to estimate uncerntainties in those kinds of approaches. For instance, if a copy number increase is detected as starting in bin #N and ending in bin #M, I think it quite makes sense to use start and end coordinates of those boundary bins to infer CIPOS and CIEND. It makes a difference whether the bins were 500 nt sized or 1,000,000 nt.

For comparison: CNV region (both deletions and duplications present)

Detection method Breakpoint uncertainty Proposal v2 Proposal v3
6. Region of CN increases and decreases inferred from read depth ✖ Unknown or not reported <CNV>
SVTYPE=CNV
<CNV>
SVTYPE=CNV

From what I see, v3 has the following advantages over v2:

  • Since v2 makes DUPs (and other SV types except for CNV) “breakpoint claims” by default, what does it mean we if specify a DUP at a given position and do not specify CIPOS and CIEND? Does it mean a claim for exact breakpoint coordinates (case 5) or claim for inexact coordinates with unknown uncertainty (case 2)? In contrast, v3 creates a clear hierarchy for all SV calls:
    • No CIPOS/CIEND → imprecise, uncertainties not estimated
    • Nonzero CIPOS/CIEND → imprecise, uncertainties given
    • Zero CIPOS/CIEND → exact call
  • Symbolic allele types and SVTYPEs are perfectly synchronised in v3. No problems with DUP having different meanings in two different contexts.
  • CNV only appears in situations where increase and decrease of copy number are mixed together in a single call (case 6), as intended by the current specification wording.

If I understand correctly, your primary concern with v3 is that case 1 is indistinguishable from case 2, as well as case 3 from case 4. But I think that this is exactly the point, as they represent the same events inferred by different methods. By making CNV represent read depth claims in v2, we are tying that symbolic allele type to a specific detection method. I think it is much better when the calls themselves are completely agnostic to the detection technology, as implemented in v3 (and again, as appears to have been intended by the existing specification wording).

Having said that, I realise that in many cases there is a need to know how the call was produced—I 100% support that. But I think this is much better done by e. g. introducing a specific (possibly even non-optional or highly encouraged) INFO field with a list of standardised detection types, e. g.:

  • READ_DEPTH
  • SPLIT_READ
  • LONG_READ_SEQUENCING
  • (The list can be periodically amended—new CNV detection types do not come up all that often)

I believe someone mentioned this possibility during one of the recent calls (can't remember who unfortunately), but did not press this further. I think this is the best approach, as it encodes useful additional information in an INFO field rather than cramming it into the symbolic allele type.

@d-cameron @cwhelan @jmmut Please do let me know what you think.

@d-cameron
Copy link
Contributor Author

Please do let me know what you think.

My concerns with proposal 3 are:

  1. Redundancy and complexity

P3 requires 3 fields when we can disambiguate with 2. It still does not clarify what alt=, svtype= actually means.

Symbolic allele types and SVTYPEs are perfectly synchronised in v3. No problems with DUP having different meanings in two different contexts.

This problem still exists in p3 - it just requires yet another field to determine what their meaning is.

Exactly synchronise the lists of top level structural symbolic alleles and SVTYPE values

It sounds very much like the difference between p2 and p3 is whether we reuse SVTYPE for disambiguation or deprecate SVTYPE and use a new field.

Less complex is always better than more complex when we're dealing with specifications. I don't think the utility of this additional field justifies the additional complexity.

  1. New field vocabulary

The vocabulary must be either open or close. Both have issues. An open vocabulary would be required as new technologies will get introduced in the future but to unambigously interpret variants then the vocab needs to be closed. Realistically, new technology callers will just lie about their evidence and choose whatever pre-defined value corresponds to the claim being made. E.g. using your closed vocab, a microarray caller will just write READ_DEPTH, and a read-pair based caller will just write SPLIT_READ. This entirely defeats the purpopse of having tech type.

  1. Technology agnotic

I think it is much better when the calls themselves are completely agnostic to the detection technology.

I agree with this. VCF is currently technology agnotic and it should remain so. It shouldn't matter whether read depth or microarray probe intensity is used to define a CNV, merely that a CNV claim is being made. My issue with p3 is that it's not tech agnostic. You need to look up the tech source field to determine what claim is being made when a VCF has a DUP record.


what does it mean we if specify a DUP at a given position and do not specify CIPOS and CIEND?

I always interpreted an unspecificied confidence interval implicitly indicates an interval of [0,0] and my caller only writes CIPOS when the confidence interval is non-zero.

@d-cameron
Copy link
Contributor Author

d-cameron commented Feb 25, 2020

If I understand correctly, your primary concern with v3 is that case 1 is indistinguishable from case 2, as well as case 3 from case 4.

Case 1 and 2 being indistinguishable is exactly my primary concern. Yes a simple tandem duplication looks the same for both case 1, and case 2, but case 1 and 2 are making different claims about the structure of the genome.

For a genome with segments ABC:

  • case 1 is claiming the end of B is connected to the start of B.
  • case 2 is claiming there are 2 copies of B

But I think that this is exactly the point, as they represent the same events inferred by different methods.

These are not claiming the same thing. An actual simple tandem duplication requires a) both of these claims to be true and b) there are no other claims interfere with B, c) there is reference allele support for an AB and/or AC transition.

Case 1 allows for ABBC or ABC / circular B
Case 2 allows for BABC, ABBC, ABCB, or ABC/circularB, and a whole host of most extreme interpretations such as AC/circularBB, any/all of which could be in inverted orientation.

And that's before we get into the possibility of missing or related events. case 1 could be part of a complex rearrangement and there may well be no copy number change of B at all.

To reiterate: case 1 and 2 are making very different claims about the structure of the genome. VCF needs to be able to unambigiously specify the actual claims being made.

@tskir
Copy link
Member

tskir commented Apr 27, 2020

@d-cameron After your comments I now understand your issues with proposal v3 much better, and I see the way to fix them. Let's continue the discussion on this:

1. Technology type vocabulary

I agree that a new “detection technology” field with either an open or a closed vocabulary is a bad idea, for the reasons you described. Let's ditch it. The claim type can be specified without it—see below.

2. Position uncertainties

Since this PR deals with claim types, not uncertainty specification, we can clarify and standardise CIPOS handling in a future separate PR. It has no direct effect on proposal v2 vs. v3 discussion.

3. Specifying claim types

Your example with the A/B/C genome was very informative. I now see what you mean by different “claim types”. It looks to me that we actually have the same view of the situation, we've just been using different terms for the same thing.

Case 1 and 2 being indistinguishable is exactly my primary concern. [...] These are not claiming the same thing.

The idea in my example was that cases 1 and 2 do represent and claim exactly the same thing. Note that the header for cases 1–5 is “Duplications (not necessarily tandem)”. The idea is that DUP in proposal v3 represents a general duplication, that is, increase in copy number in the genome, in whatever location and in whatever orientation. So in your terms, this corresponds to a copy number claim.

For tandem duplications, in v3 you don't use DUP, you use the subtype DUP:TANDEM (see it defined here as a sub-bullet point in proposal v3). So that would represent a breakpoint claim.

4. Redundancy and complexity of proposal v3

P3 requires 3 fields when we can disambiguate with 2. It still does not clarify what alt=, svtype= actually means.

Without the technology vocabulary, the (updated) proposal v3 will also use only the two existing fields, ALT and SVTYPE.

It will clarify the meaning of specific types and subtypes of structural variants in the way I described above: for example, with DUP making a general copy number claim, a DUP:TANDEM a specific version of a breakpoint claim, etc.

Also, if we are allowed to be bold and deprecate SVTYPE, since it does become redundant in this updated version of proposal v3, we can specify all information about the claim type using just one field, the symbolic allele type/subtype. I think it would the most elegant solution of all, utilising only the minimal number of existing fields, and having decent backwards compatibility.

@d-cameron @jmmut @cwhelan Please let me know if you have any additional feedback.

@yfarjoun yfarjoun added this to the VCF v4.4 milestone May 18, 2020
@d-cameron
Copy link
Contributor Author

d-cameron commented May 18, 2020

Design goals for this PR are:

  1. Retain as much backwards compatability with VCFv4.2 as possible

  2. Be able to unambiguously determine if a call is making a CN claim, a breakpoint claim, or both.

Unfortunately, 1 and 2 are incompatible so we have to break something. Realistically, we're not going to have a large number of pre-4.4 VCFs around for a very long time. As much as I'd like to say that all breakpoint-based SV caller should report everything in BND notation, that's not going to happen.

For tandem duplications, in v3 you don't use DUP, you use the subtype DUP:TANDEM (see it defined here as a sub-bullet point in proposal v3). So that would represent a breakpoint claim.

If DUP is a CN claim, DUP:TANDEM is a breakpoint claim, then how do you make a claim of an actual tandem duplication (ie CN + breakpoint)?

I know it's very late in this discussion, but I'll just put this out as an option for dicussion:

Proposal 3a: explicit specification of support type

Proposal 3 has the fundamental problem that it fails design goal 2 (eg the DUP example above). This can be resolved with an additional SVCLAIM with allowable values of CN, BP, and CNBP (not wedded to theses names, feel free to propose something different) that explicitly specifies the type of claim this record is making. This approach fully satisfies the design goals but it does result in additional specifications complexity, (although in some senses, it's the simplest solution). It doesn't resolve the many way an event can be inconsistent (e.g. SVTYPE=DEL ALT=<DUP>) and even creates more (e.g. ALT=N[chr:pos[, SVCLAIM=CN) but this just requires verbiage in the specs, and more validator rules. It's a trade-off worth considering.

Practically, this approach has a few advantages. One that I particularly like is that downstream v4.4-aware SV event classifiers tools will already be dealing with the current ALT/SVTYPE mess, and it'd relatively simple to specify the pre-4.4 VCF out of band. E.g. classify_events --svinput manta.vcf --cninput ascat.vcf, or classify_events --input manta.vcf --svclaim BP. Most existing VCFs contain either all CN or all BP claims. It's only DBs such as dbVar that have a mix of both.

@jmmut
Copy link
Contributor

jmmut commented May 18, 2020

If DUP is a CN claim, DUP:TANDEM is a breakpoint claim, then how do you make a claim of an actual tandem duplication (ie CN + breakpoint)?

For me this boils down to my initial confusion, where I thought that a breakpoint claim included the CN claim. After some of your explanations I understood how they are independent, and the class of CN+breakpoint is different than only breakpoint. For this, I see how v3 as we have discussed is inherently flawed as you could only specify CN or CN+breakpoint, and if we change v3 to allow the 3 claims, it will be basically the same as the changes in this PR.

The alternative (partially) done in this PR is still not clear to me about how to identify a CN, BP or CN+BP using only ALT+SVTYPE, but we can keep trying to make it clearer. At this point I'm a bit skeptical we can make it very clear. The clearest to me would be something similar to the PR in the sense that ALT=[INS,DEL,INV,DUP] is CN+BP; ALT=CNV is CN and ALT=BND is BP; and SVTYPE is used to explain the common type (INS,DEL,INV,DUP) if ALT=CNV and ALT=BND. SVTYPE=[BND,CNV] could be used for types that are more complex than the other 4 simple types. But this is problematic if someone wants to state CN+BP using the bracket notation (I don't know if this is a real problem or if there are other possible complications). For this last reason I think I prefer the newest option:

The third idea of "SVCLAIM with allowable values of CN, BP, and CNBP" has the clarity I was hoping to get from reorganising ALT and SVTYPE, but the biggest problem I see with that is that people will stick to only ALT+SVTYPE unless we make SVCLAIM required. It looks a bit of a pain, but I think it would be justified to make SVCLAIM required for SVs. It would still make sense to clarify (and simplify?) ALT+SVTYPE, but this becomes non-critical.

@mbaudis
Copy link

mbaudis commented May 18, 2020

DUP (CNV) and tandem duplications have no clear relation. DUP:TANDEM is NOT a subset of DUP, but a precisely described genomic alteration in contrast to an expression about a change in absolute or relative count of the number of alleles over a continuous genomic region.

A sane option is to just have DUP, DEL as the single statements for describing CNVs, to have the option for chaining them to flanking events by id (obviously, CNVs are flanked by breaks, fusions and this follows a graph approach).

There is IMO no good use case, in a VCF, to have something tagged as CNV. This tag shouldn't exist?!

Tandems, precisely defined indels should be just that. Users can decide themselves from which size on they parse them as CNVs, and how to do that.

Please keep (rather, make) it simple.

@cwhelan
Copy link

cwhelan commented May 20, 2020

@mbaudis Events need to have type CNV if they are measuring total copy number of a segment and you can't confidently say that there is only one type of alt allele (DEL or DUP) at the locus.. unless you have another proposal for how to represent that use case.

@d-cameron The SVCLAIM idea seems good to me at first glance. Our integrated WGS pipeline (descended from the gnomAD SV pipeline) includes events called both by depth signal only, breakpoint signal, or both, in one final output vcf. We already track a value roughly equivalent to SVCLAIM; standardizing it in the spec would make sense to me.

@d-cameron
Copy link
Contributor Author

Good catch, I didn't notice that bit of the specs.Left-aligning is the wrong choice for SVs because a) it can't be done for +/+ or -/- oriented breakpoints (left alignment of one side causes right alignment of the other), and b) for RP-based caller, left-alignment frequently results in a called position that is the least likely of the positions in the CIPOS interval.

SVs should be either centre-aligned or aligned to the most likely position.

Copy link
Member

@tskir tskir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-cameron Here's the fresh round of reviews, as promised. (Sorry it comes at the latest possible moment before the deadline!)

Once the remaining comments are addressed, I think we can go ahead and merge this, and then tweak as necessary to align with subsequent PRs (such as the SVCLAIM one etc.)

I've also resolved several remaining discussion and extracted them into separate issues to keep the scope of this PR reasonable.

@@ -173,30 +173,49 @@ \subsubsection{Individual format field format}
Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the scope for this is going to be VCF 4.4, could you please rebase these changes against VCFv4.4.draft.tex?

\begin{tabular}{l l}
DEL & Deletion relative to the reference \\
INS & Insertion relative to the reference \\
DUP & Tandem duplication relative to the reference \\
Copy link
Member

@tskir tskir Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DUP & Tandem duplication relative to the reference \\
DUP & Duplication relative to the reference. This refers to any quantitative increase of the number of alleles compared to the reference genome, without indication about the physical location of the additional copies \\

As discussed, DUP is going to be just a generic “duplication” and the details will be provided in a separate SVCLAIM field #517.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss that "duplication" refers to a "any quantitative increase of the number of alleles compared to the reference genome, without indication about the physical location of the additional copies". I.e. tandems can be a subset, as can be e.g. extrachromosomal amplifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That's an excellent phrasing as well, I've added it to the suggested change now

Comment on lines +212 to +214
\noindent Variants should be written using the most precise type that can be determined by the variant caller. For example, if the insertion site of a new copy of a LINE1 element cannot be determined, it would be specified as a DUP of the originating LINE1 element. However if the new insertion site can be identified, the variant should be specified as INS:ME:LINE1 at the insertion site.\newline

\noindent Note that the DUP type is restricted to simple tandem duplications. More complex duplications should be specified using BND notation.\newline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
\noindent Variants should be written using the most precise type that can be determined by the variant caller. For example, if the insertion site of a new copy of a LINE1 element cannot be determined, it would be specified as a DUP of the originating LINE1 element. However if the new insertion site can be identified, the variant should be specified as INS:ME:LINE1 at the insertion site.\newline
\noindent Note that the DUP type is restricted to simple tandem duplications. More complex duplications should be specified using BND notation.\newline
\noindent Variants should be written using the most precise type that can be determined by the variant caller.\newline

Looks like these examples contradict the approaches discussed in #517, so I suggest we remove them in this PR and then add more up-to-date examples as applicable.

\item BND: Breakend
\item DEL: Deletion relative to the reference
\item INS: Insertion relative to the reference
\item DUP: Tandem duplication relative to the reference
Copy link
Member

@tskir tskir Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
\item DUP: Tandem duplication relative to the reference
\item DUP: Duplication relative to the reference

Again, looks like this no longer applies and is superseded by #517.

@d-cameron
Copy link
Contributor Author

#553 can we just kill SVTYPE for v4.4? We have to redefine it as TYPE=A anyway so v4.4 is guaranteed to have a breaking change to SVTYPE.

As it currently stands, I think the burden of having SVTYPE outweights the value it adds to the specifications and it should be removed. It can be entirely inferred from ALT (implementation could even expose a read-only SVTYPE in their APIs for backwards compatibiilty), and it's not that much greater a burden on the quick and dirty parsing grep SVTYPE=DEL | wc -l style analyses.

@tskir
Copy link
Member

tskir commented Apr 19, 2021

@d-cameron Personally, I lean mostly in favour of killing SVTYPE, the reason being precisely its complete reduncancy. Implementations exposing a read-only attribute sounds like a great idea as well.

However, I really think we should separate the changes to avoid the PRs blocking each other. Do you think that, for the time being, we could rebaseline and merge this PR as it is, and then get back to SVTYPE discussions in #553?

@d-cameron
Copy link
Contributor Author

Current issues:

There's not really much left in this PR except for clarifications of reserved SV symbolic ALT alleles, and some additional examples. Should I move these into their own PRs and address there?

@tskir
Copy link
Member

tskir commented Jul 12, 2021

@d-cameron Regarding this:

There's not really much left in this PR except for clarifications of reserved SV symbolic ALT alleles, and some additional examples. Should I move these into their own PRs and address there?

Yes, I think moving each of those changes into a separate PR would work best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.