-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with allowed contig names in the SAM spec #124
Comments
@yfarjoun I can totally empathize with your problems. I've run into similar problems with other tools (a different variant caller would, internally, invoke shell commands with naked reference names, and the unquoted Unfortunately I think this ship has sailed. Fwiw I suspect your issue isn't with the main hs38 assembly, but with the hs38DH reference created by Presumably the problem you're having can be fixed by checking to see if |
I was hoping that quotes are not found in the wild in contig names and that we would be able to ban them. I agree about the ending being a suggestion/best practice. The workaround we are considering is to append ':1+' to the commandline argument. But that only solves the problem for GATK, not your problem with * |
Oh, I would also ban the backslash if possible. |
':[0-9](-[0-9])?' is an issue not only with GATK, but with any variant caller using VCF. Due to parsing breakend parsing ambiguities, the VCF specifications (v4.1) explicitly prohibits colons anywhere in contig names, although implementations generally either ignore this (since they leave such parsing to the end user), or only die when choosing the 'wrong' interpretation of the ambiguity. Greater than and less than have similar issues where could be reference contig of that name or an assembly contig from an external file whose name does not include the tags. |
I suspect that even if you get GATK working, you'll have continual trouble with downstream analysis due to your output not being a valid VCF file. |
I agree. Do you have a recommendation or are you raising this as further On Sun, Jan 17, 2016 at 6:16 PM, Daniel Cameron notifications@github.com
|
I think @tfenne is correct that we can't make the requiring changes without breaking backward compatibility. One option that I think has merit is to introduce the concept of compliance levels and elevate the meaning of section 2 (Recommended Practice for the SAM Format) so that a level 1 compliant SAM/BAM complies the basic section 1 SAM specifications, and level 2 Some of the section 2 recommended practices should only be recommended in some cases so it's important that such a distinction be made. For example, fields that can be regenerated (eg, NM, MC, R2, Q2, SA?, SEQ & QUAL for secondary alignments) are not wanted for archival purposes but can be quite important for acceptable runtime performance of variant calling and other downstream analysis. A compliance level approach allows us to retains backward compatibility but has the significant advantage that a whole host of edge cases and sanity checks can be externalised into compliance checking/fixing tools. Essentially, level 1 = parses as SAM, level 2 = actually make sense. |
@d-cameron I really like the idea of adding some kind of official compliance/strictness levels to the specification. Picard's ValidateSamFile has for a long time performed significantly more validations that one would infer from a strict reading of the spec, but is incredibly useful. It would be nice to codify some of those validations and similar best practices in the spec in a way that encourages more people to test them. |
See also samtools/samtools#149 |
Another character that seems to be banned from "all VCF strings" is the comma...so probably the comma should not be allowed in in level 2 strictness. What will happen if we make ":" not level 2 compliant, and then try to process hg38? |
Discussion on this issue in the context of the VCF spec specifically (rather than the SAM spec, which was the focus here) is happening in this ticket: #258 |
I have just reread all these contig name issues and learned a little about the HLA allele naming scheme after this hilarity and the ensuing conversation. Looking at file format specifications in isolation, and using every day as we do the interval notation chrom:start-end that's been in the UCSC genome browser and other tools since forever… from this point of view putting colons in contig names seems crazy. Elsewhere, the HLA folks have been cataloging their alleles for some decades — previously using notation like In 2010, they needed more digits, and added colons to their standard allele names. So that name became A few years later, these names start getting used in reference FASTA files and make their way into SAM files. We could say that we meant for colons to be disallowed in RNAME and invite people to use So I now agree with @tfenne that the ship has surely sailed for colons. Even if we had disallowed them from the start, HLA people were accustomed to colon-containing names and would have used them in their SAM files anyway (because why should they want/need to read the spec, and samtools doesn't check) and filed bugs against Picard's validation if it rejected them 😄 However judging by the counts in #291, we can still forbid commas, most kinds of brackets, and maybe semicolons without alienating anyone — and IMHO we should do so. This leaves a problem for tools, which would like to allow users to say chrom:start[-end] conveniently and unambiguously, and for format specifications, which need to parse contig names in fields or as parts of fields unambiguously. There's basically four ways to solve this:
For tools, intervals are specified on the command line as well as interactively on web pages or in applications. So escaping with If the tool already knows the list of valid contig names, it could parse an interval as follows:
IMHO it's important to make all these checks: I don't want my (At present, htslib and samtools do part of this: the then part of the outer This is convenient for the current set of HLA names (see https://github.com/ANHIG/IMGTHLA/blob/Latest/fasta/hla_gen.fasta), which have the property that none is a colon-delimited prefix of any of the others (i.e., there's not both e.g. So I think this heuristic is necessary for convenience in tools' user interfaces and suffices to solve the problem without introducing ambiguities or vulnerabilities (if implemented properly!). For file formats, heuristics are not an option. Formats need to be simply and unambiguously parsable. For SAM, I don't think there is a problem with colons: we aren't using a colon to delimit any contig names in subfields as part of any SAM fields. We are delimiting them with commas and perhaps semicolons in the SA tag (and the proposed OA tag), so we ought to forbid those characters in contig names before they become a problem. The counts in #291 suggest that comma at least is certainly not currently a problem. For VCF, the major concern is around breakends, which look like What real parsing problems are there in parsing the formats with colons in contig names? Certainly there are (solvable) problems in tools' user interfaces, but actually there doesn't seem to be too much of a problem for the underlying formats. (Phew!) (Commas OTOH we need to do something about.) |
Completely agree. Back in 2014 I viewed this as a tool chain issue (see samtools/samtools#149), but admittedly I didn't realise about break-ends in VCF then. As it turns out it's still not an issue there. I fully agree also that we need to nail down the unused (or rarely used) cases to limit RNAME, and it's why we gathered that data in the first place. It's already been pointed out that comma causes bugs elsewhere (eg SAM SA tag) as we use comma separated data. The only commas that @daviesrob found were related to a malformed fasta file due to some parsing bug that has put crap all over the place. If it's equally rare elsewhere (and it appears to be) then banning it is a good step. Otherwise we'll look back on this in a few years with Foo,BAR ref name and be yelling at our past selves! Quoting: agreed also. Originally I proposed =name= as starting with an equals character is forbidden (it's used in SAM mate ref-name to mean the same reference), but having seen braces aren't used yet, thank goodness, it makes for a more usual quoting syntax. Finally, I'll update the htslib PR I put in (samtools/htslib#708) to do more checks. It does most of your parsing suggestions above already bar the final case of spotting both ways of parsing matching known references. I may as well put in quoting support while I'm at it as a separate commit so we can go with it if we get nominal acceptance here. Edit: for completeness, I should also point out that the most recent samtools issue we had raised on this was resolved by pointing out that RNAME: (instead of RNAME) to refer to the entirety of the reference was accepted and works regardless of whether RNAME contains colons. It's ugly though and I prefer quoting syntax for command line tools. |
Also improved return values so NULL is primary way of detecting failure rather than tid. This is more in line with the old hts_parse_reg code. See samtools/hts-specs#124 (comment) for heuristic suggestions.
If by chance you are referring to the issue I raised in samtools/htslib#705, then I should clarify that I do not consider the issue 'resolved' by the fact that a workaround exists. In my project, samtools was (almost) silently failing and producing incomplete results. What saved me was the fact that I dug deeper and found the cause of the problem. Others may not be so lucky. |
Yes I did mean that. Sorry I didn't mean resolved in that manner (the problem still exists, hence my PR to address this in better ways), rather I meant in github terms the issue was resolved (by you) as a suitable temporary workaround while the better solution is in review. Thanks for your help in this. |
Pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124.
Pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124.
Pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124.
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Also improved return values so NULL is primary way of detecting failure rather than tid. This is more in line with the old hts_parse_reg code. See samtools/hts-specs#124 (comment) for heuristic suggestions.
Also improved return values so NULL is primary way of detecting failure rather than tid. This is more in line with the old hts_parse_reg code. See samtools/hts-specs#124 (comment) for heuristic suggestions.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Search box now updates in real time to show the currently displayed location. If the view has more than one displayed region, the search box is disabled and the refName dropdown is blank. If an invalid location is entered in the search box, it triggers an error message snackbar.
We've recently come across an hg38 contig name of the form 'HLA-A*01:01:01:01' which caused problems when used on the commandline in the GATK, as the parser understood the suffix :01 to mean "at position 1" and then is complained that it cannot find HLA-A*01:01:01 (which thankfully isn't in hg38..., otherwise we would have probably not noticed this for a while. Thus the argument -L HLA-A*01:01:01:01 is not one that the GATK can handle.
Its too late for hg38 to ask for any name changes, but I think it might make sense to revisit the allowed characters and remove a few. Currently there are a two characters that are not allowed as the first character in the @sn (
*
and=
) but the remaining printable characters (!
through~
) are allowed, and they are all allowed as the following characters. I would propose that we add a few more characters to the "not allowed" list. In particular it would be good to avoid the various quotes (',
,") anywhere in the @SN and I would like to suggest that the ending should not look like a genomic position, i.e. it should **not** match the regex '
:[0-9](-[0-9])?`'. The reason I say "suggest" is that given that hg38 already includes this ending, it would be quite difficult to reverse that and even if we only put it in a future "version" of the SAM Spec it isn't clear what that means since the old references will still be around...This isn't a formal proposal...I mostly want to know what people think about this issue and see if we can come up with good solutions.
The text was updated successfully, but these errors were encountered: