Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5

Open
jmarshall opened this issue Feb 3, 2020 · 9 comments

Comments

@jmarshall
Copy link
Member

jmarshall commented Feb 3, 2020

There are a number of short identifier-sized pieces of metadata that are used across many GA4GH products. For example:

  • Reference sequence names

    In SAM/BAM/CRAM, this is the @SQ-SN header field and RNAME/RNEXT/etc fields.
    In VCF/BCF, it's the ##contig ID.
    In htsget, it's referenceName.
    In refget, it may be returned as an alias.

  • Sample identifiers

    In SAM/BAM/CRAM, this is the @RG-SM header field.
    In VCF/BCF, it's the ##SAMPLE ID and it also appears on the #CHROM header line.
    In htsget, it forms the bulk of the path part of request URLs, and there is a proposal to encode samples in the query part as well (htsget: add samples query parameter, principally to select subset of VCF columns samtools/hts-specs#430).
    In Phenopackets, it's a Biosample's id field.

These items of metadata are embedded within the surrounding text using various delimiters in these various formats and protocols. So there are various restrictions on what characters may appear in them so as to avoid conflicting with the delimiter characters or otherwise requiring complicated escaping or encoding mechanisms. It would be good to harmonise these restrictions across GA4GH products, so that a value that was e.g. a valid Sample identifier in one product could be assumed to also be valid in other products.

@jmarshall jmarshall changed the title Harmonise identifier (e.g. RNAME/Contig; Sample) rules across formats and protocols Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols Feb 3, 2020
@jmarshall
Copy link
Member Author

jmarshall commented Feb 3, 2020

For reference sequence names, SAM and VCF (and hence BAM, CRAM, and BCF) have a very specific regular expression that disallows whitespace, backslashes, commas, various quotation marks, and brackets, and also = or * as the first character (see samtools/hts-specs#333 and samtools/hts-specs#379):

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

In htsget and refget, these appear as double-quoted JSON strings. So the SAM regex values would fit in these formats without additional escaping or syntax. OTOH neither htsget nor refget suggests that these values are anything other than arbitrary strings.

@jmarshall
Copy link
Member Author

For sample identifiers, both SAM and VCF disallow tabs, and VCF currently de facto disallows commas. It is proposed (samtools/hts-specs#414) to be explicit about VCF's commas.

Htsget does not currently tie this down in any way, though it would probably be impractical to use a ? character and whitespace would be semi-impractical. One proposal for the query part would also disallow the comma character.

In phenopackets, this appears to be described only as an “arbitrary identifier”.

It might be good to try to restrict the character set further (beyond <tab> and <comma>) to free up more punctuation characters for convenient use as format delimiters and on tools' command lines, but this would require analysis of what constitutes a sample name in the wild at present. For reference sequence names we collected fairly extensive statistics (see samtools/hts-specs#333 (comment)), but I am unaware of that having been done for sample name identifiers.

@rrfreimuth
Copy link
Collaborator

I think GA4GH needs a common set of core data types, with identifier being one of those. The types should be based on existing standards (e.g., ISO) so we don't reinvent the wheel, and they should be as technology/language-agnostic as possible to support implementations in a variety of systems.

Just my 2 cents. Looking forward to the discussion.

@mamanambiya mamanambiya pinned this issue Apr 21, 2020
@jmarshall
Copy link
Member Author

@cdvoisin: Thanks, that's interesting. It's a bit different from the very particular individual classes of metadata item that this issue is trying to focus on, as they inherit the existing defined syntax of URLs.

@rrfreimuth: Those are good general principles. But as was hopefully clear in the discussion, this issue is intended to be about specific items of metadata individually and e.g. is just referring to this group of particular items collectively as “identifiers”.

@jmarshall
Copy link
Member Author

Next steps on this IMHO is to start with (say) reference sequence names, and try to answer the questions posed in the presentation in the April meeting:

  • Do other working groups (besides LSG with SAM & VCF) have any similar restrictions on their equivalents of reference sequence names?

    (We would want to harmonise to the lowest common denominator.)

  • Should this definition of RNAME identifier be promoted as the GA4GH portable reference sequence name?

    If so, how? Ways of doing this include:

    • Incorporate this regexp or equivalent into other GA4GH standards explicitly
    • Add references in other GA4GH standards to some pan-GA4GH location describing this pan-GA4GH standard RNAME building block
    • Or simply note in other GA4GH standards that “SAM has rules restricting the characters used in RNAMEs — see the SAM specification for details”

@jmarshall
Copy link
Member Author

Re sample identifiers, I believe Phenopackets has a representation for such a field:

A Biosample refers to a unit of biological material from which the substrate molecules (e.g. genomic DNA, RNA, proteins) for molecular analyses (e.g. sequencing, array hybridisation, mass-spectrometry) are extracted.
[…]

Field Type Status Description
id string required arbitrary identifier
[…]

Example

{
  "id": "sample1",
  "individualId": "patient1",
  "description": "",
  […]

@frafrx or other Phenopackets experts: Biosample describes this id field simply as an “arbitrary identifier”. Does Phenopackets have any other rules about how these ids may be formed? Would Phenopackets wish to align with VCF's rules disallowing tabs and commas (and possibly other punctuation characters to be determined)?

@mbaudis
Copy link
Member

mbaudis commented Jul 10, 2023

@jmarshall AFAIK in Phenopackets & Beacon we follow the principles of

  • id is resource-local (but could use namespaced identifier)
  • the id of a schema defined object (biosample, subject/individual, ...) can be referenced in other schemas with the schema's name (i.e. the id of a biosample can be referenced in a derived analysis as biosampleId or biosample_id)

@jkbonfield
Copy link

jkbonfield commented Jul 11, 2023

Late to the party, and too late to change for VCF too, but I dislike the word "contig" being used as just another form of sequence. So did Rodger Staden, the person who coined the word "contig": https://staden.sourceforge.net/contig.html Note for the curious this also uses "gel readings", which later just got shortened to the "reads" we know today.

The original definitions have one thing very clear - it's contiguous, without gaps (if I recall Rodger later adimtted he probably meant continous as the reads don't have to abut, just overlap). Genome browsers started muddling things when they didn't understand the difference between a set of overlapping reads forming a contig, and their consensus sequence. They started just using contig instead of consensus sequence, and in doing so lost of the original meaning and caused more confusion. It then got corrupted even further when they stopped caring whether the sequence was even contiguous or not. Sadly that's where VCF ended up. (Although I note it sometimes uses "Chromosome" instead.)

So my preference is definitely for "reference sequence" or similar, and this also fits far better with most of the other use cases here (SAM, BAM, CRAM, Refget, etc). If we're talking about assemblies, then sometimes "consensus sequence" is more appropriate, but the two have largely interchangeable use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants