Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with allowed contig names in the SAM spec #124

Closed
yfarjoun opened this issue Jan 16, 2016 · 15 comments
Closed

Problems with allowed contig names in the SAM spec #124

yfarjoun opened this issue Jan 16, 2016 · 15 comments

Comments

@yfarjoun
Copy link
Contributor

We've recently come across an hg38 contig name of the form 'HLA-A*01:01:01:01' which caused problems when used on the commandline in the GATK, as the parser understood the suffix :01 to mean "at position 1" and then is complained that it cannot find HLA-A*01:01:01 (which thankfully isn't in hg38..., otherwise we would have probably not noticed this for a while. Thus the argument -L HLA-A*01:01:01:01 is not one that the GATK can handle.

Its too late for hg38 to ask for any name changes, but I think it might make sense to revisit the allowed characters and remove a few. Currently there are a two characters that are not allowed as the first character in the @sn (* and =) but the remaining printable characters (! through ~) are allowed, and they are all allowed as the following characters. I would propose that we add a few more characters to the "not allowed" list. In particular it would be good to avoid the various quotes (',,") anywhere in the @SN and I would like to suggest that the ending should not look like a genomic position, i.e. it should **not** match the regex ':[0-9](-[0-9])?`'. The reason I say "suggest" is that given that hg38 already includes this ending, it would be quite difficult to reverse that and even if we only put it in a future "version" of the SAM Spec it isn't clear what that means since the old references will still be around...

This isn't a formal proposal...I mostly want to know what people think about this issue and see if we can come up with good solutions.

@tfenne
Copy link
Member

tfenne commented Jan 16, 2016

@yfarjoun I can totally empathize with your problems. I've run into similar problems with other tools (a different variant caller would, internally, invoke shell commands with naked reference names, and the unquoted * caused problems there).

Unfortunately I think this ship has sailed. Fwiw I suspect your issue isn't with the main hs38 assembly, but with the hs38DH reference created by bwa.kit. Any change that would prohibit any new characters in reference names, or prohibit particular suffixes, would necessarily be backwards incompatible. And that's likely to lead to tools, moving forward, not supporting previously valid reference names, and leaving a lot of data stranded and in need of re-writing. I'd suggest that the best we can do is to add a best practice not to name sequences like this.

Presumably the problem you're having can be fixed by checking to see if HLA-A*01:01:01:01 or HLA-A*01:01:01 or HLA-A*01:01 is a valid reference name before choosing how to deconstruct the string? You're only really in trouble if two or more of them are valid reference names.

@yfarjoun
Copy link
Contributor Author

I was hoping that quotes are not found in the wild in contig names and that we would be able to ban them. I agree about the ending being a suggestion/best practice.

The workaround we are considering is to append ':1+' to the commandline argument. But that only solves the problem for GATK, not your problem with *

@yfarjoun
Copy link
Contributor Author

Oh, I would also ban the backslash if possible.

@d-cameron
Copy link
Contributor

':[0-9](-[0-9])?' is an issue not only with GATK, but with any variant caller using VCF. Due to parsing breakend parsing ambiguities, the VCF specifications (v4.1) explicitly prohibits colons anywhere in contig names, although implementations generally either ignore this (since they leave such parsing to the end user), or only die when choosing the 'wrong' interpretation of the ambiguity. Greater than and less than have similar issues where could be reference contig of that name or an assembly contig from an external file whose name does not include the tags.

@d-cameron
Copy link
Contributor

I suspect that even if you get GATK working, you'll have continual trouble with downstream analysis due to your output not being a valid VCF file.

@yfarjoun
Copy link
Contributor Author

I agree. Do you have a recommendation or are you raising this as further
evidence that the spec should change?

On Sun, Jan 17, 2016 at 6:16 PM, Daniel Cameron notifications@github.com
wrote:

I suspect that even if you get GATK working, you'll have continual trouble
with downstream analysis due to your output not being a valid VCF file.


Reply to this email directly or view it on GitHub
#124 (comment).

@d-cameron
Copy link
Contributor

I think @tfenne is correct that we can't make the requiring changes without breaking backward compatibility. One option that I think has merit is to introduce the concept of compliance levels and elevate the meaning of section 2 (Recommended Practice for the SAM Format) so that a level 1 compliant SAM/BAM complies the basic section 1 SAM specifications, and level 2
level 2: complies with section 2 recommended practice as well as additional new restrictions such as VCF-compliant QNAMES, CIGARs not having dangling I/D/N operators, fields (eg mate coordinate) and predefined tags (eg NM, MC, R2, Q2, TC, ...) actually have the correct value, read pairs are actually paired and have matching information, RNAME is unique, QNAMEs all match an SG header, FLAG fields make sense, predefined tags are of the correct type, ...

Some of the section 2 recommended practices should only be recommended in some cases so it's important that such a distinction be made. For example, fields that can be regenerated (eg, NM, MC, R2, Q2, SA?, SEQ & QUAL for secondary alignments) are not wanted for archival purposes but can be quite important for acceptable runtime performance of variant calling and other downstream analysis.

A compliance level approach allows us to retains backward compatibility but has the significant advantage that a whole host of edge cases and sanity checks can be externalised into compliance checking/fixing tools. Essentially, level 1 = parses as SAM, level 2 = actually make sense.

@tfenne
Copy link
Member

tfenne commented Jan 19, 2016

@d-cameron I really like the idea of adding some kind of official compliance/strictness levels to the specification. Picard's ValidateSamFile has for a long time performed significantly more validations that one would infer from a strict reading of the spec, but is incredibly useful. It would be nice to codify some of those validations and similar best practices in the spec in a way that encourages more people to test them.

@jkbonfield
Copy link
Contributor

See also samtools/samtools#149
I'm not convinced all those ideas are sane though! What was I thinking? :)

@yfarjoun
Copy link
Contributor Author

Another character that seems to be banned from "all VCF strings" is the comma...so probably the comma should not be allowed in in level 2 strictness.

What will happen if we make ":" not level 2 compliant, and then try to process hg38?

@droazen
Copy link

droazen commented Oct 25, 2017

Discussion on this issue in the context of the VCF spec specifically (rather than the SAM spec, which was the focus here) is happening in this ticket: #258

@droazen droazen changed the title Problems with allowed contig names. Problems with allowed contig names in the SAM spec Oct 25, 2017
@jmarshall
Copy link
Member

I have just reread all these contig name issues and learned a little about the HLA allele naming scheme after this hilarity and the ensuing conversation.

Looking at file format specifications in isolation, and using every day as we do the interval notation chrom:start-end that's been in the UCSC genome browser and other tools since forever… from this point of view putting colons in contig names seems crazy.

Elsewhere, the HLA folks have been cataloging their alleles for some decades — previously using notation like A*01010101, in which all the 01s are two-digit subfields in some biologically-meaningful hierarchy. And obviously there wouldn't be more than a hundred allele categories at each level 😄.

In 2010, they needed more digits, and added colons to their standard allele names. So that name became A*01:01:01:01. And other names like A*02:101 appeared (so heuristics about leading zeros won't help us). Presumably they had a big conference where they decided that using colons was the least-worst option.

A few years later, these names start getting used in reference FASTA files and make their way into SAM files. We could say that we meant for colons to be disallowed in RNAME and invite people to use _ or / or something instead in their HLA contigs, but I don't see any reason why they would listen to us: then the names in their SAM files would differ from in all their other literature, and that would suck.

So I now agree with @tfenne that the ship has surely sailed for colons. Even if we had disallowed them from the start, HLA people were accustomed to colon-containing names and would have used them in their SAM files anyway (because why should they want/need to read the spec, and samtools doesn't check) and filed bugs against Picard's validation if it rejected them 😄

However judging by the counts in #291, we can still forbid commas, most kinds of brackets, and maybe semicolons without alienating anyone — and IMHO we should do so.


This leaves a problem for tools, which would like to allow users to say chrom:start[-end] conveniently and unambiguously, and for format specifications, which need to parse contig names in fields or as parts of fields unambiguously. There's basically four ways to solve this:

  1. Heuristics
  2. Escaping
  3. Quoting
  4. Extra or distinctive trailing delimiters

For tools, intervals are specified on the command line as well as interactively on web pages or in applications. So escaping with \ or quoting with "' is a non-starter, as they will be eaten by the shell (or need to be double-escaped, which is not convenient). "Quoting" with brackets would be a possible help: e.g. the notation could be chrom:start-end or {chrom}:start-end with the optional braces required when the contig name contains a colon. Or a trailing colon in chrom: could indicate all of chrom explicitly (none of the HLA contigs have trailings colons, and we could forbid that). But this still leaves A*02:101 ambiguous — to solve that requires heuristics. And ideally we wouldn't have to introduce any new syntax like those {} or trailing :.

If the tool already knows the list of valid contig names, it could parse an interval as follows:

if the string foo:bar has a plausible interval suffix (i.e. bar is NUM or NUM-NUM)

if both foo and foo:bar are existing contigs then …error it's ambiguous
else if foo is an existing contig then return (foo, NUMNUM)
else if foo:bar is an existing contig then return (foo:bar, whole contig)
else …error unknown contig

else …either there's no colon or bar is not plausibly numeric

if the whole string str is an existing contig then return (str, whole contig)
else …error unknown contig

IMHO it's important to make all these checks: I don't want my chr1:1000-2000 command line argument meaning something else because a malicious BAM file contains a contig named chr1:1000-2000 as well as the chr1 I'm looking for!

(At present, htslib and samtools do part of this: the then part of the outer if is just return (foo, NUM, NUM). I think this means they do the right thing already for HLA-A*30:14L but not for HLA-A*30:14, which needs the existing-contig-name heuristics.)

This is convenient for the current set of HLA names (see https://github.com/ANHIG/IMGTHLA/blob/Latest/fasta/hla_gen.fasta), which have the property that none is a colon-delimited prefix of any of the others (i.e., there's not both e.g. A*01:01:01 and A*01:01:01:01). If the set of names didn't have that property, then the ambiguous error would trigger from time to time and you would need extra syntax (like the {} or trailing :) to disambiguate in these cases.

So I think this heuristic is necessary for convenience in tools' user interfaces and suffices to solve the problem without introducing ambiguities or vulnerabilities (if implemented properly!).


For file formats, heuristics are not an option. Formats need to be simply and unambiguously parsable.

For SAM, I don't think there is a problem with colons: we aren't using a colon to delimit any contig names in subfields as part of any SAM fields. We are delimiting them with commas and perhaps semicolons in the SA tag (and the proposed OA tag), so we ought to forbid those characters in contig names before they become a problem. The counts in #291 suggest that comma at least is certainly not currently a problem.

For VCF, the major concern is around breakends, which look like ]chrom:pos] or [chrom:pos[. As @d-cameron noted on #258 there is no actual ambiguity here — as the :pos part is always present. While a chrom containing colons means you can't use the simple left-to-right parsing of this, it's not much harder to read to the final bracket and then back up to the rightmost colon.

What real parsing problems are there in parsing the formats with colons in contig names? Certainly there are (solvable) problems in tools' user interfaces, but actually there doesn't seem to be too much of a problem for the underlying formats. (Phew!) (Commas OTOH we need to do something about.)

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 1, 2018

Completely agree. Back in 2014 I viewed this as a tool chain issue (see samtools/samtools#149), but admittedly I didn't realise about break-ends in VCF then. As it turns out it's still not an issue there.

I fully agree also that we need to nail down the unused (or rarely used) cases to limit RNAME, and it's why we gathered that data in the first place. It's already been pointed out that comma causes bugs elsewhere (eg SAM SA tag) as we use comma separated data. The only commas that @daviesrob found were related to a malformed fasta file due to some parsing bug that has put crap all over the place. If it's equally rare elsewhere (and it appears to be) then banning it is a good step. Otherwise we'll look back on this in a few years with Foo,BAR ref name and be yelling at our past selves!

Quoting: agreed also. Originally I proposed =name= as starting with an equals character is forbidden (it's used in SAM mate ref-name to mean the same reference), but having seen braces aren't used yet, thank goodness, it makes for a more usual quoting syntax.

Finally, I'll update the htslib PR I put in (samtools/htslib#708) to do more checks. It does most of your parsing suggestions above already bar the final case of spotting both ways of parsing matching known references. I may as well put in quoting support while I'm at it as a separate commit so we can go with it if we get nominal acceptance here.

Edit: for completeness, I should also point out that the most recent samtools issue we had raised on this was resolved by pointing out that RNAME: (instead of RNAME) to refer to the entirety of the reference was accepted and works regardless of whether RNAME contains colons. It's ugly though and I prefer quoting syntax for command line tools.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jun 1, 2018
Also improved return values so NULL is primary way of detecting
failure rather than tid.  This is more in line with the old
hts_parse_reg code.

See samtools/hts-specs#124 (comment)
for heuristic suggestions.
@Falcury
Copy link

Falcury commented Jun 2, 2018

Edit: for completeness, I should also point out that the most recent samtools issue we had raised on this was resolved by pointing out that RNAME: (instead of RNAME) to refer to the entirety of the reference was accepted and works regardless of whether RNAME contains colons.

If by chance you are referring to the issue I raised in samtools/htslib#705, then I should clarify that I do not consider the issue 'resolved' by the fact that a workaround exists. In my project, samtools was (almost) silently failing and producing incomplete results. What saved me was the fact that I dug deeper and found the cause of the problem. Others may not be so lucky.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 4, 2018

Yes I did mean that. Sorry I didn't mean resolved in that manner (the problem still exists, hence my PR to address this in better ways), rather I meant in github terms the issue was resolved (by you) as a suitable temporary workaround while the better solution is in review.

Thanks for your help in this.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue Sep 19, 2018
Pseudocode description of algorithm to detect ambiguous input,
as proposed in a comment on samtools#124.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Sep 19, 2018
Pseudocode description of algorithm to detect ambiguous input,
as proposed in a comment on samtools#124.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Sep 20, 2018
Pseudocode description of algorithm to detect ambiguous input,
as proposed in a comment on samtools#124.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 11, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 15, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 5, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 5, 2019
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Apr 9, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Apr 9, 2019
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 29, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 29, 2019
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
jkbonfield added a commit to samtools/htslib that referenced this issue May 29, 2019
Also improved return values so NULL is primary way of detecting
failure rather than tid.  This is more in line with the old
hts_parse_reg code.

See samtools/hts-specs#124 (comment)
for heuristic suggestions.
daviesrob pushed a commit to jkbonfield/htslib that referenced this issue May 31, 2019
Also improved return values so NULL is primary way of detecting
failure rather than tid.  This is more in line with the old
hts_parse_reg code.

See samtools/hts-specs#124 (comment)
for heuristic suggestions.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 6, 2020
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 6, 2020
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
@jmmut jmmut closed this as completed in 0ef79c0 Mar 3, 2020
cmdcolin referenced this issue in GMOD/jbrowse-components Mar 20, 2020
Search box now updates in real time to show the currently displayed
location. If the view has more than one displayed region, the search box
is disabled and the refName dropdown is blank. If an invalid location is
entered in the search box, it triggers an error message snackbar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants