Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving ambiguities introduced when supporting colons in hg38 names #291

Closed
cyenyxe opened this issue Mar 1, 2018 · 16 comments
Closed
Labels

Comments

@cyenyxe
Copy link
Member

cyenyxe commented Mar 1, 2018

I have re-read the whole hg38 contig name pull request thread #258 and wanted to suggest some related, longer term ideas without polluting it.

In that thread, most people are in favor of supporting colons in contig names, and making the parsers resolve breakend ambiguities in favor of contig:position or contig:position-position.

This is a legitimate use case, but I am concerned because we have been trying to reduce the ambiguities in the spec, and this change will introduce a new one. Would people be open to slightly modify the breakend notation in future versions of the spec, for instance by replacing the colon with a dot or comma? I think these characters are not allowed in contig names according to the SAM spec, but could someone please confirm it?

The representation would change from something like:

G]HLA-A*01:01:01:01:123456]

To something like the following:

G]HLA-A*01:01:01:01.123456]
G]HLA-A*01:01:01:01,123456]
@cyenyxe cyenyxe added the vcf label Mar 1, 2018
@daviesrob
Copy link
Member

Unfortunately the SAM specification allows pretty much anything in a reference name. The first character is not allowed to be * or =. The regex for the rest is /[!-~]*/ which includes all printable non-whitespace characters in US-ASCII.

@yfarjoun
Copy link
Contributor

yfarjoun commented Mar 1, 2018 via email

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 1, 2018

As added to #193 also, comma is already a problem for us in SAM world due to SA tags and the new OA tag. Given it's currently unused in all the legal fai files we found, I'd be in favour of simply banning comma from contig names so we can get some sanity back in the world.

Edit: please could others also do a scan through all their local reference files looking for meta-characters. It's possible it's never been used here, but is in active use somewhere else.

I wonder if it's possible to trawl EBI and NCBI achives for SAM and VCF headers?

@daviesrob
Copy link
Member

I should mention my previous table of punctuation characters found in reference files:

 # 203
 % 203
 * 525
 + 1
 , 496
 - 154226
 . 1826561
 : 1577
 = 26
 _ 4961932
 | 1098333

So comma isn't too bad (all the commas came from a single badly-formatted file), but dot is not a good choice. Semicolon looks fairly safe at the moment.

It would be a very good idea if everyone with access to large collections of reference sequences did a similar analysis. We can compare results and see if these are typical or not.

@yfarjoun
Copy link
Contributor

yfarjoun commented Mar 2, 2018

Here are our results:

     12 #
    527 *
    357 ,
1451132 -
1492749 .
  86114 /
 233731 :
   2034 =
     17 @
1735713 |

I used the following:

while read file; do  
grep '@SQ' $file | cut -f2 | sed 's/SN://; s/[A-Za-z0-9]*//g; s/\(.\)/\1\n/g'  ;
done < <(find  /references | grep dict  )   | sort | uniq -c > operators_in_references

@jkbonfield
Copy link
Contributor

Thanks Yossi. Good command line hint too.

Any clue where the commas come from? We saw some, but they were a totally borked fasta file produced by someones file parsing going wrong. As it stands now we know comma will already produce problems with SAM due to SA tags.

Still no semicolon, although I'd rather not introduce a different list separator.

@nh13
Copy link
Member

nh13 commented Mar 2, 2018

@jmarshall @jkbonfield can anyone perform the same exercise on the CRAM reference registry? Or the NCBI reference sequence database? Could be fun.

@jkbonfield
Copy link
Contributor

Not us - that's an EBI internal thing.

Also the cram reference registry is accessed by md5 only and has no name associated with the sequences. However EBI have many other databases of sequence data so maybe they can screen those.

@yfarjoun
Copy link
Contributor

yfarjoun commented Mar 5, 2018

we could disallow (:|-)[0-9]* at the end of contig names. this might not break current references and so we might be able to get it into the sam-spec. It will allow non-ambiguous parsing for the purpose of intervals. a similar trick might remove the ambiguity regarding breakpoint...

@jkbonfield
Copy link
Contributor

That works, although I think it bumps the level of parser required a bit higher - it now needs backtracking by potentially many characters during tokenisation of a region. Pragmatically that may not matter if we're just coding it ad-lib rather than using a formal grammar.

@cyenyxe
Copy link
Member Author

cyenyxe commented Mar 5, 2018

@yfarjoun My regex skills may be a bit rusty, but wouldn't that make hg38 contig names illegal again? The example initially provided in #258 is:

@SQ	SN:HLA-A*01:01:01:01	LN:3503	M5:01cd0df602495b044b2c214d69a60aa2	AS:38	UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta	SP:Homo sapiens

@yfarjoun
Copy link
Contributor

yfarjoun commented Mar 5, 2018

hmmm. yes.

Though we could capitalize on the '0' in 01 and disallow the ending to be of the form
(:|-)([1-9][0-9]*)? which would allow parsing (assuming no-one looks for chr1-03:10, with a zero preceding the 3...)

@jkbonfield
Copy link
Contributor

The offending GRCh38 decoy file also has "HLA-DRB1*12:17" which breaks this idea.

Basically it's bust whatever we do unless we simply have code in the command line tools that look first a contig matching the full string and if not found then attempt to process as a region and repeat again.

@cyenyxe
Copy link
Member Author

cyenyxe commented Mar 20, 2018

These are the counts from all genome assembly sequences (contigs, scaffold, chromosomes) submitted to ENA since the current assembly submission pipeline was introduced in 2013:

#   16927
*   1
,   231
-   122563947
.   521540419
/   236951
\   0
:   30181
;   72892
=   186611
@   3713
|   949

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 5, 2018

For completeness so they're in the same spot, from Rob's data in #220

 # 203
 % 203
 * 525
 + 1
 , 496
 - 154226
 . 1826561
 : 1577
 = 26
 _ 4961932
 | 1098333

@yfarjoun above:

     12 #
    527 *
    357 ,
1451132 -
1492749 .
  86114 /
 233731 :
   2034 =
     17 @
1735713 |

@d-cameron
Copy link
Contributor

d-cameron commented Apr 8, 2018

I am concerned because we have been trying to reduce the ambiguities in the spec, and this change will introduce a new one

Just for clarity: what is the ambiguity that the proposal introduces? If we force the spec to resolve in favor of contig:position in breakends, doesn't that solve this issue for VCF? AFAIK, we don't actually have any interval (contig:position-position ) notations in the VCF specs: they're all done via INFO fields such as CIPOS.

I was under the impression that the discussions regarding contig:position-position were about the tooling more generally (e.g. specifying intervals in various tools) thus technically out of scope for the VCF specifications themselves.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 11, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 15, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 5, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Apr 9, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 29, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 6, 2020
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
@jmmut jmmut closed this as completed in 0ef79c0 Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants