-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add @SQ-AN alternative reference sequence names #220
Conversation
For information, I did a count of the number of punctuation characters in the entry names of our set of 292 reference fasta files. I got:
The 496 commas all come from
Colons are in
... and a few other files:
I guess GRCh38 is the biggest problem here. |
Inspired by TAG:TYPE:VALUE is OK, but note we already have TAG:TYPE:VALUE,VALUE,VALUE for some tags and TAG:TYPE:VALUE;VALUE;VALUE for others when we're dealing with lists of values. Frankly anything you do here will fit in given lack of consistency elsewhere! Disallowing colon seems very logical and solves a lot of problems we've had, but I'd be inclined to still use comma as the separator. I don't have any strong feelings though so would accept it as it is or with comma instead. |
Recreated using commas as a more natural list delimiter. I hear there is either VCF or at least bcftools option notation using commas as refseq name delimiters, and, closer to home, SAM's SA tag already uses a comma to terminate a reference sequence name. |
More previous discussion of forbidding characters in #124 and samtools/samtools#149. |
fb1bf8c
to
9250f65
Compare
Enables tools to allow users to make queries with e.g. "1" or "chr1" interchangeably. Also allows for the possibility of tools using an alias when displaying sequence names to the user. Hat tip @lindenb, fixes samtools#100. However aliases must not appear elsewhere within the SAM file, in particular not in RNAME/RNEXT fields. This ensures that files will still be parsed correctly by non-@SQ-AN-aware tools.
Enables tools to allow users to make queries with e.g. "1" or "chr1" interchangeably. Also allows for the possibility of tools using an alias when displaying sequence names to the user. Hat tip @lindenb, fixes samtools#100. However aliases must not appear elsewhere within the SAM file, in particular not in RNAME/RNEXT fields. This ensures that files will still be parsed correctly by non-@SQ-AN-aware tools.
Proposed text to implement #100, "suggestion: aliases in
@SQ
section".Also take the opportunity to pre-emptively restrict the characters allowed in these alternative names.
As noted in last September's meeting, restricting the characters allowed in
AN
alternative names could be a first step towards applying the same restriction inSN
canonical sequence names and finally being able to parse the colon inchr:beg-end
unambiguously.There is debate to be had about which punctuation characters to forbid; there are existing points for consideration in #100, #124, and #167.
The primary motivation here is to disallow
:
sochr:beg-end
region syntax is unambiguous, so we have taken advantage of colon being available as the delimiter, inspired byTAG:TYPE:VALUE
delimiters.Other motivation might be to disallow
,
sochr1,chr2,chr3
list syntax is unambiguous. If disallowing comma has particular support, perhaps using comma as theAN
delimiter would be preferable.