-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Restrictions on Contig Names not Strict Enough #167
Comments
For what it's worth, I support the first option. and I would even strengthen that to support disallowing other "annoying" characters (annoying for command-line parsing purposes) such as:
In fact, I would support disallowing all non-alphanumerics except for a select few:
I know that this is incompatible with SAM, and we should find a way to deal with that. |
I think that before making anything more restrictive it's worth doing a survey of what's actually being used. E.g. disallowing anything that's in the broadly used hs38DH (including |
I am, of-course well aware of the : and * in hs38DH...and would love for there to be a solution that would not cause a lot of pain....I guess I was venting some frustration about that. |
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
…llow colons) (#379) * Allow colons in VCF Contig IDs: breakend notation is unambiguous Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of #124. Fixes #258. Closes #291. * Restrict allowed VCF Contig ID chars to those allowed in SAM RNAMEs Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes #124 and fixes #167 for VCF; their SAM aspects were previously fixed by PR #333.
I see that in section 1.4.7 of VCFv4.3, it states that contig names can mostly follow restrictions placed upon them by the SAM spec except for a few explicitly disallowed characters. It would seem that the restricted characters are not enough to preserve parsing of VCF files. Specifically, the SAM format restrictions specify (”[!-)+-<>-
][!-]”) which translates into any printable character so long as it isn't in the first place. We place some restrictions on this, namely specifying that the characters ”<>[]:” are additionally non-valid. Since the contig name has to appear in a header line this runs the risk of making it impossible to parse a header string. An example contig header line might be:##contig=<ID=ctg1,length=81195210,Description="Foo">
Unfortunately both the
=
and the,
characters are legal according to the new restrictions so technically the string "ctg1,length=81195210" COULD be a valid contig name, which means that parsing is ambiguous. Something needs to be done about these two characters at least, the way I see it there are a couple options:=
and,
would fix this problem at the expense of further restricting contigs allowed by the SAM spec.<>
or""
characters making the header line look something like this:##contig=<ID=<ctg1>,length=81195210,Description="Foo">
which would break forward compatibility and add some complexity into parsing.The text was updated successfully, but these errors were encountered: