Add spec for list, filtered list, and attribute endpoints. #81

nsheff · 2024-08-07T12:28:10Z

No description provided.

andrewyatz

Just the one correction sorry

docs/seqcol.md

nsheff · 2024-08-09T01:32:50Z

I added some rationale to the decision record. Please take a look and add/change where you like!

docs/seqcol.md

sveinugu · 2024-08-21T19:51:14Z

docs/seqcol.md

+["1216","970","1788"]
+```
+
+Example `/attribute/names/:digest` return value:


Example missing object type also here

I don't understand what you mean by "missing the object type" here, sorry.

docs/seqcol.md

sveinugu · 2024-08-21T20:18:32Z

Also wanted to add in my review a general point, but messed it up. So here it is;

Should we require all the endpoints? Has this been discussed? For instance, if you are only interested in using the attribute endpoint, should collection then be required?

docs/seqcol.md

sveinugu · 2024-08-21T22:02:22Z

I have now listened to the recording of the informal meeting I thought was cancelled. A bit annoying to not be able to comment on some very interesting discussions, so I wrote down a series of comments while listening:

On whether the endpoints should be required: I am happy with both attribute and list being required. But collection should then move down to recommended, to allow for the new use cases where they are only interested in the sequences attribute within the spec. I also made a comment about this in the review above.
Regarding Nathan's comment on whether required or recommended really matters, I think he is thinking very much with an American researcher brain. I think in Europe and probably Asia, people are less likely to decide to not fully follow a spec, especially if implementation is done in an infrastructure setting. So I think it does matter to some degree.
Nathan was right in that I would like to have the full list/attribute(s?) endpoint in the spec, but I I didn't want to spend much energy in getting it in there now, as it can easily be added later.
Regarding skipping the digest of the sorted_names_lengths_pairs attribute:

I am all for it and wanted to suggest it before (I believe) Tim did, but then realised it would be like talking to your TV (which I happen to do occasionally!)
However, I think the endpoint should return a list of canonicalised strings (encoded in utf8), not a list of objects. The reason is that with the former, you can just canonicalize the list (which in this case is just about removing whitespace) and digest that string to get to the digest. This is exactly what you would to for the other attributes, e.g. "names" or "sequences". If you in this case return a list of objects, you would need to treat 'sorted_names_lengths_pairs' differently than the other attributes.
In the RFC, objects are sorted on their keys, e.g. {"names": "chr1", "lengths": 1234} will be canonicalized to b'{"lengths":1234,"names":"chr1"}', encoded in utf8. Lists are obviously not sorted. I think the difference between the two confused you.
The main reason I originally thought a double-digest would be fine is that it's guaranteed to be ASCII. The canonicalized JSON is encoded in utf8 and we would need to think about whether that messes with sorting.
That said, I am 99% sure sorting an utf8 is clearly defined and not an issue. The only thing we need to make sure is that people actually provide the results in utf8, so that should be stated in the spec somewhere, e.g. all endpoints should provide results with the Content-type: application/json; charset=utf-8 header set

Adding a level option to the attribute endpoint would really be about un-digesting the digests. As a consequence, the sequences attribute for level=2 should really expand into a list of full sequences. Which might be something we want, but expands the issue. To me, just skipping the digest is a simpler and more straightforward solution. I think this is also the conclusion you ended up with, except that I would want the contents to be the canonicalized strings, and not the pre-canonicalized objects.

nsheff · 2024-08-21T22:09:36Z

I'm with you on most of this, but I feel pretty strongly that the /attribute endpoint would return objects, and not strings.

This is an API, and it should return JSON -- not a string of a JSON object! I actually think you've got the rationale backwards here:

The reason is that with the former, you can just canonicalize the list (which in this case is just about removing whitespace) and digest that string to get to the digest. This is exactly what you would to for the other attributes, e.g. "names" or "sequences". If you in this case return a list of objects, you would need to treat 'sorted_names_lengths_pairs' differently than the other attributes.

I think for the other attributes, you'd get the object itself -- which is a string. I think you're getting confused because those objects are strings.

So to me, to return strings would be modify the object, whereas in other cases you'd just return the object. So it's actually the opposite of what you say -- if you return a list of objects, everything would be treated the same. If you want to return a list of strings, not only is this much less convenient for the consumer of the API, but you'd also have to treat 'sorted_name_length_pairs' differently from the other attributes.

sveinugu · 2024-08-21T22:52:24Z

Well, I feel pretty strongly the other way around! 😄

The issue is how the digest algorithm is specified in relation to the schema. The digest algorithm starts with the attributes in the canonical seqcol object representation, which means (I believe) "according to the schema". So the question is then what is the items.type of the sorted-names-lengths-pair attribute? I am not sure if we have specified the schema for this anywhere, but the way I see it, according to the current spec, it should be string, just like sequences. And just like sequences refers to refget identifiers, the strings in the sorted-names-lengths-pair array refer to something else which is external to the main digest algorithm, which is the outcome of the algorithm specified in section 4.1. My suggestion would be to just remove step 3 in that algorithm and change step 4 to sort the canonicalized strings instead of the digests. Your suggestion would entail that we in addition add another step after 4 to de-canonicalize the strings back into objects.

I thought one in addition needed to make a special case out of the sorted-names-lengths-pair attribute in the main algorithm, adding an initial canonicalization step, but you are right in that the digest algorithm would still work as it is with objects as items, the objects would just be canonicalized together with the lists.

The main issue is the sorting step. Without canonicalization, sorting can be performed in several incompatible ways, especially if we support non-ascii keys. So we would want to canonicalize before sorting. Your suggestion adds an (to me, unnecessary) additional step to de-canonicalize the sorted strings.

One advantage of my approach is that it is trivial to get from the contents of a names-lengths-pair attribute (which I think we should now also add to the spec, btw, as this would be the one-to-one match for a "chromlen" file) to the contents of the sorted-names-lengths-pair attribute. As both are lists of strings, you would only need to lexicographically sort one to get to the other. If both are lists of objects, however, you would need to go through a canonicalization-sort-decanonicalization procedure, which is less trivial.

sveinugu · 2024-08-21T23:10:08Z

How your solution would look in Python:

>>> a = [{"names": "seq1", "lengths": 1000}, {"names": "seq2", "lengths": 500}]
>>> sorted(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'dict' and 'dict'
>>> sorted(a, key=lambda x:x['names'])
[{'names': 'seq1', 'lengths': 1000}, {'names': 'seq2', 'lengths': 500}]
>>> sorted(a, key=lambda x:x['lengths'])
[{'names': 'seq2', 'lengths': 500}, {'names': 'seq1', 'lengths': 1000}]

Compare this to:

>>> a = [b'{"lengths":1000,"names":"seq1"}', b'{"lengths":500,"names":"seq2"}']
>>> sorted(a)
[b'{"lengths":1000,"names":"seq1"}', b'{"lengths":500,"names":"seq2"}']

and I believe you should see my point.

Edit: What I called "de-canonicalization" above is just parsing an utf8-encoded JSON string, e.g. in Python:

>>> import json
>>> json.loads(b'{"lengths":1000,"names":"seq1"}')
{'lengths': 1000, 'names': 'seq1'}

So my suggested output is just a list of JSON strings, the user just needs to parse each one independently to get to the objects. The main problem with your suggestion is that one would need to implement canonicalization in more use cases than with my suggestion. With my suggestion I believe canonicalization would only be needed for calculating the digests. If you need to sort, you can use the strings, and if you want to compare you can either do it with strings if that is what you have, or you can parse the strings into objects to compare with other objects.

sveinugu · 2024-08-21T23:20:02Z

Which actually points to the main reason I opted for an extra digest step initially: Sorting a utf8-encoded byte-string can easily be confused with sorting a decoded string, but these are different things and I believe they may not sort the same for non-ascii characters. So digesting into an ascii string makes it hard to do the sorting wrong. But the advantage of getting an easily understandable output from the attributes endpoint makes up for the possibility for people to implement the sorting incorrectly. The latter can in any case be mitigated by a good test suite.

docs/seqcol.md

Co-authored-by: Sveinung Gundersen <sveinugu@ifi.uio.no>

Oct updates

nsheff added 2 commits August 7, 2024 08:25

Add spec for list, filtered list, and attribute endpoints.

b0c2756

add object type to attribute endpoint

2f58521

nsheff requested review from andrewyatz, tcezard, ahwagner and sveinugu August 7, 2024 12:28

andrewyatz requested changes Aug 7, 2024

View reviewed changes

docs/seqcol.md Outdated Show resolved Hide resolved

nsheff added 2 commits August 7, 2024 14:07

fix pagination braces

8afd68d

add recent decisions to record

740735e

nsheff commented Aug 21, 2024

View reviewed changes

docs/seqcol.md Outdated Show resolved Hide resolved

Update docs/seqcol.md

316c12c

tcezard approved these changes Aug 21, 2024

View reviewed changes

andrewyatz approved these changes Aug 21, 2024

View reviewed changes

sveinugu requested changes Aug 21, 2024

View reviewed changes

sveinugu reviewed Aug 21, 2024

View reviewed changes

docs/seqcol.md Show resolved Hide resolved

nsheff added 3 commits August 23, 2024 17:32

typos

7fc9631

typos

7b8bafa

Update readme to reflect new branding

f8254a0

nsheff commented Sep 4, 2024

View reviewed changes

docs/seqcol.md Outdated Show resolved Hide resolved

nsheff added 4 commits October 2, 2024 08:54

switch list endpoint to use query params

88e7bd3

update decision record from last decisions.

b15692a

typo

db78af7

clarify

0f20a99

nsheff and others added 4 commits October 2, 2024 17:38

Update docs/seqcol.md

6575bd1

Co-authored-by: Sveinung Gundersen <sveinugu@ifi.uio.no>

Update docs/decision_record.md

951dd87

Update docs/seqcol.md

fa487c2

Merge pull request #85 from ga4gh/oct-updates

19a0668

Oct updates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spec for list, filtered list, and attribute endpoints. #81

Add spec for list, filtered list, and attribute endpoints. #81

nsheff commented Aug 7, 2024

andrewyatz left a comment

nsheff commented Aug 9, 2024

sveinugu Aug 21, 2024

nsheff Sep 4, 2024

sveinugu commented Aug 21, 2024

sveinugu commented Aug 21, 2024

nsheff commented Aug 21, 2024

sveinugu commented Aug 21, 2024 •

edited

Loading

sveinugu commented Aug 21, 2024 •

edited

Loading

sveinugu commented Aug 21, 2024 •

edited

Loading

Add spec for list, filtered list, and attribute endpoints. #81

Are you sure you want to change the base?

Add spec for list, filtered list, and attribute endpoints. #81

Conversation

nsheff commented Aug 7, 2024

andrewyatz left a comment

Choose a reason for hiding this comment

nsheff commented Aug 9, 2024

sveinugu Aug 21, 2024

Choose a reason for hiding this comment

nsheff Sep 4, 2024

Choose a reason for hiding this comment

sveinugu commented Aug 21, 2024

sveinugu commented Aug 21, 2024

nsheff commented Aug 21, 2024

sveinugu commented Aug 21, 2024 • edited Loading

sveinugu commented Aug 21, 2024 • edited Loading

sveinugu commented Aug 21, 2024 • edited Loading

sveinugu commented Aug 21, 2024 •

edited

Loading

sveinugu commented Aug 21, 2024 •

edited

Loading

sveinugu commented Aug 21, 2024 •

edited

Loading