Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @SQ-AN alternative reference sequence names #220

Merged
merged 1 commit into from
Jul 27, 2017

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Jun 1, 2017

Proposed text to implement #100, "suggestion: aliases in @SQ section".

Also take the opportunity to pre-emptively restrict the characters allowed in these alternative names.

As noted in last September's meeting, restricting the characters allowed in AN alternative names could be a first step towards applying the same restriction in SN canonical sequence names and finally being able to parse the colon in chr:beg-end unambiguously.

There is debate to be had about which punctuation characters to forbid; there are existing points for consideration in #100, #124, and #167.

The primary motivation here is to disallow : so chr:beg-end region syntax is unambiguous, so we have taken advantage of colon being available as the delimiter, inspired by TAG:TYPE:VALUE delimiters.

Other motivation might be to disallow , so chr1,chr2,chr3 list syntax is unambiguous. If disallowing comma has particular support, perhaps using comma as the AN delimiter would be preferable.

@daviesrob
Copy link
Member

For information, I did a count of the number of punctuation characters in the entry names of our set of 292 reference fasta files. I got:

 # 203
 % 203
 * 525
 + 1
 , 496
 - 154226
 . 1826561
 : 1577
 = 26
 _ 4961932
 | 1098333

The 496 commas all come from GRCh38.rRNA.fa which contains entries like this:

>ENSG00000276700|RNA, 5.8S ribosomal 5 [Source:HGNC Symbol;Acc:HGNC:37660]|RNA5-8S5|rRNA
>gi|189571632|ref|NR_023379.1| Homo sapiens RNA, 5S ribosomal 17 (RNA5S17), ribosomal RNA
>ENSG00000199334|RNA, 5S ribosomal 11 [Source:HGNC Symbol;Acc:HGNC:34372]|RNA5S11|rRNA
>ENSG00000201966|RNA, 5.8S ribosomal pseudogene 4 [Source:HGNC Symbol;Acc:HGNC:41958]|RNA5-8SP4|rRNA

Colons are in Homo_sapiens.GRCh38_full_analysis_set_plus_decoy_hla.fa (HLA regions) ...:

>HLA-A*01:01:01:01	HLA00001 3503 bp
>HLA-A*01:01:01:02N	HLA02169 3291 bp
>HLA-A*01:01:38L	HLA03587 3374 bp
>HLA-A*01:02	HLA00002 3374 bp

... and a few other files:

Plasmodium_berghei.rRNA.fa:
>berg10:rRNA:rfamscan:938985-939103

Plasmodium_falciparum.rRNA.fa:
>MAL14_5S_2|MAL14_5S_2:rRNA

Trichuris_muris_v0_1_281111.fa:
>NODE_48597_length_3376_cov_5.922986.1.1633:00009.7180000457210-00010.7180000828236

Trichuris_muris_11_03_2015.fa:
>U_TMUE_mito:1

haemonchus_V1.fa:
>scaffold2.1_size831208:1-30808

I guess GRCh38 is the biggest problem here.

@jkbonfield
Copy link
Contributor

Inspired by TAG:TYPE:VALUE is OK, but note we already have TAG:TYPE:VALUE,VALUE,VALUE for some tags and TAG:TYPE:VALUE;VALUE;VALUE for others when we're dealing with lists of values. Frankly anything you do here will fit in given lack of consistency elsewhere!

Disallowing colon seems very logical and solves a lot of problems we've had, but I'd be inclined to still use comma as the separator. I don't have any strong feelings though so would accept it as it is or with comma instead.

@jmarshall
Copy link
Member Author

Recreated using commas as a more natural list delimiter. I hear there is either VCF or at least bcftools option notation using commas as refseq name delimiters, and, closer to home, SAM's SA tag already uses a comma to terminate a reference sequence name.

@jmarshall
Copy link
Member Author

More previous discussion of forbidding characters in #124 and samtools/samtools#149.

@jmarshall jmarshall force-pushed the sq-aliases branch 2 times, most recently from fb1bf8c to 9250f65 Compare July 27, 2017 16:34
jmarshall added a commit to jmarshall/hts-specs that referenced this pull request Jul 27, 2017
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants