Merge pull request #44 from ga4gh/spec_rewrite

shot at bringing draft spec up to date with adrs
ga4gh · Aug 22, 2023 · 2e96056 · 2e96056
2 parents f04dd1c + 8ac2c4d
commit 2e96056
Show file tree

Hide file tree

Showing 9 changed files with 283 additions and 68 deletions.
diff --git a/docs/compare_collections.md b/docs/compare_collections.md
@@ -0,0 +1,45 @@
+
+# How to: Compare two collections
+
+## Use case
+
+- You have a local sequence collection, and an identifier for a collection in a server. You want to compare the two to see if they have the same coordinate system.
+- You have two identifiers for collections you know are stored by a server. You want to compare them.
+
+## How to do it
+
+You can use the `/comparison/:digest1/:digest2` endpoint to compare two collections. The comparison function gives information-rich feedback about the two collections, but it can take some thought to interpret. Here are some examples
+
+### Strict identity
+
+Some analyses may require that the collections be *strictly identical* -- that is, they have the same sequence content, with the same names, in the same order. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order. 
+
+ This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint. **Two collections will have the same digest if they are identicial in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match.
+
+### Order-relaxed identity
+
+A process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, you may be interested in saying "these two sequence collections have identical content and sequence names, but differ in order". The `/comparison` return value can answer this question:
+
+Two collections meet the criteria for order-relaxed identity if:
+
+1. the value of the `elements.total.a` and `elements.total.b` match, (the collections have the same number of elements).
+2. this value is the same as `elements.a-and-b.<attribute>` for all attributes (the content is the same)
+3. all entries in `elements.a-and-b-same-order.<attribute>` are false (the order differs for all attributes)
+
+Then, we know the sequence collection content is identical, but in a different order. 
+
+###### Name-relaxed identity
+
+Some analysis (for example, a `salmon` alignment) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we'd like to be able to say "These sequence collections have identical content, even if their names and/or orders differ."
+
+How to assess: As long as the `a-and-b` number for `sequences` equals the values listed in `elements.total`, then the sequence content in the two collections is identical
+
+###### Length-only compatible (shared coordinate system)
+
+A much weaker type of compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses.
+
+How to assess: We will ignore the `sequences` attribute, but ensure that the `names` and `lengths` numbers for `a-and-b` match what we expect from `elements.total`. If the `a-and-b-same-order` is also true for both `names` and `lengths`, then we can be assured that the two collections share an ordered coordinate system. If however, their coordinate system matches but is not in the same order, then we require looking at the `sorted_name_length_pairs` attribute. If the `a-and-b` entry for `sorted_name_length_pairs` is the same as the number for `names` and `lengths`, then these collections share an (unordered) coordinate system.
+
+### Others...
+
+There are also probably other types of compatibility you can assess using the result of the `/comparison` function. Now that you know the basics, and once you have an understanding of what the comparison function results mean, it should be possible to figure out if you can assess a particular type of compatibility for your use case.
diff --git a/docs/digest_from_collection.md b/docs/digest_from_collection.md
@@ -0,0 +1,114 @@
+
+# How to: Compute a seqcol digest given a sequence collection
+
+## Use case
+
+
+One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard).
+
+
+## 1. Using existing implementations
+
+### Reference implementation in Python
+
+If working from within Python, you can use the reference implementation like this:
+
+1. Install the seqcol package with some variant of `pip install seqcol`.
+2. Build up your canonical seqcol object
+3. Compute its digest:
+
+```
+import seqcol
+seqcol.digest(seqcol_obj)
+```
+
+If you have a FASTA file, you could get a canonical seqcol object like this:
+
+```
+seqcol_obj = seqcol.csc_from_fasta(fa_file)
+```
+
+## 2. Implement the seqcol digest algorithm yourself
+
+Follow the procedure under the section for [Encoding](/specification/#1-encoding-computing-sequence-digests-from-sequence-collections). Briefly, the steps are:
+
+- **Step 1**. Organize the sequence collection data into *canonical seqcol object representation*.
+- **Step 2**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) (JCS) to canonicalize the value associated with each attribute individually.
+- **Step 3**. Digest each canonicalized attribute value using the GA4GH digest algorithm.
+- **Step 4**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) again to canonicalize the JSON of new seqcol object representation.
+- **Step 5**. Digest the final canonical representation again.
+
+Details on each step can be found in the specification.
+
+
+### Example Python code for computing a seqcol encoding
+
+```python
+# Demo for encoding a sequence collection
+
+import binascii
+import hashlib
+import json
+
+def canonical_str(item: dict) -> str:
+    """Convert a dict into a canonical string representation"""
+    return json.dumps(
+        item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True
+    )
+
+def trunc512_digest(seq, offset=24):
+    """ GA4GH digest function """
+    digest = hashlib.sha512(seq.encode()).digest()
+    hex_digest = binascii.hexlify(digest[:offset])
+    return hex_digest.decode()
+
+# 1. Get data as canonical seqcol object representation
+
+seqcol_obj = {
+  "lengths": [
+    248956422,
+    133797422,
+    135086622
+  ],
+  "names": [
+    "chr1",
+    "chr2",
+    "chr3"
+  ],
+  "sequences": [
+    "2648ae1bacce4ec4b6cf337dcae37816",
+    "907112d17fcb73bcab1ed1c72b97ce68",
+    "1511375dc2dd1b633af8cf439ae90cec"
+  ]
+}
+
+# Step 1a: We would here need to remove any non-inherent attributes,
+# so that only the inherent attributes contribute to the digest.
+# In this example, all attributes are inherent.
+
+# Step 2: Apply RFC-8785 to canonicalize the value 
+# associated with each attribute individually.
+
+seqcol_obj2 = {}
+for attribute in seqcol_obj:
+    seqcol_obj2[attribute] = canonical_str(seqcol_obj[attribute])
+seqcol_obj2  # visualize the result
+
+# Step 3: Digest each canonicalized attribute value
+# using the GA4GH digest algorithm.
+
+seqcol_obj3 = {}
+for attribute in seqcol_obj2:
+    seqcol_obj3[attribute] = trunc512_digest(seqcol_obj2[attribute])
+print(json.dumps(seqcol_obj3, indent=2))  # visualize the result
+
+# Step 4: Apply RFC-8785 again to canonicalize the JSON 
+# of new seqcol object representation.
+
+seqcol_obj4 = canonical_str(seqcol_obj3)
+seqcol_obj4  # visualize the result
+
+# Step 5: Digest the final canonical representation again.
+
+seqcol_digest = trunc512_digest(seqcol_obj4)
+```
diff --git a/docs/digest_from_fasta.md b/docs/digest_from_fasta.md
diff --git a/docs/fasta_from_digest.md b/docs/fasta_from_digest.md
diff --git a/docs/img/favicon.ico b/docs/img/favicon.ico