Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR: sequence digest algorithm to be GA4GH digest #31

Merged
merged 7 commits into from
Aug 25, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/decision_record.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,24 @@

[TOC]

## 2022-05-11 - Sequence identifier specification

### Decision

The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should specify where we intend to use the identifiers. Something along the line of

Suggested change
The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation.
The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation.
It will be used to digest:
- the sequences that are stored in the `sequences` array
- the canonical representation of arrays of level 2
- the canonical representation of the sequence collection of level 1


### Rationale

GA4GH identifiers were created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html), which included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting GA4GH identifiers ensures sequence collections remains inline with newer standards within the GA4GH ecosystem.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should talk about the prefix ga4gh:SQ in this ADR. First, I don't think we want to use SQ since it is reserved for sequences and second it's not clear that the prefix is part of all the digest that Sequence Collection uses. It should only be used for the level identifier 0.


### Limitations

GA4GH identifiers are not the default identifier used by standards such as CRAM, which uses MD5. We expect sequence collection providers to offer additional identifier arrays to provide compatability with these other formats and to declare their sequence identifier support via service-info.

### Linked issues

- [https://github.com/ga4gh/seqcol-spec/issues/30](https://github.com/ga4gh/seqcol-spec/issues/30)

## 2021-12-01 - Endpoint names and structure

### Decision
Expand Down