Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spec for list, filtered list, and attribute endpoints. #81

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

nsheff
Copy link
Member

@nsheff nsheff commented Aug 7, 2024

No description provided.

Copy link
Collaborator

@andrewyatz andrewyatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one correction sorry

docs/seqcol.md Outdated Show resolved Hide resolved
@nsheff
Copy link
Member Author

nsheff commented Aug 9, 2024

I added some rationale to the decision record. Please take a look and add/change where you like!

docs/seqcol.md Outdated Show resolved Hide resolved
docs/seqcol.md Outdated Show resolved Hide resolved
docs/seqcol.md Outdated
["1216","970","1788"]
```

Example `/attribute/names/:digest` return value:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example missing object type also here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by "missing the object type" here, sorry.

docs/seqcol.md Outdated Show resolved Hide resolved
docs/seqcol.md Outdated Show resolved Hide resolved
docs/seqcol.md Outdated Show resolved Hide resolved
@sveinugu
Copy link
Collaborator

Also wanted to add in my review a general point, but messed it up. So here it is;

Should we require all the endpoints? Has this been discussed? For instance, if you are only interested in using the attribute endpoint, should collection then be required?

@sveinugu
Copy link
Collaborator

I have now listened to the recording of the informal meeting I thought was cancelled. A bit annoying to not be able to comment on some very interesting discussions, so I wrote down a series of comments while listening:

  1. On whether the endpoints should be required: I am happy with both attribute and list being required. But collection should then move down to recommended, to allow for the new use cases where they are only interested in the sequences attribute within the spec. I also made a comment about this in the review above.
  2. Regarding Nathan's comment on whether required or recommended really matters, I think he is thinking very much with an American researcher brain. I think in Europe and probably Asia, people are less likely to decide to not fully follow a spec, especially if implementation is done in an infrastructure setting. So I think it does matter to some degree.
  3. Nathan was right in that I would like to have the full list/attribute(s?) endpoint in the spec, but I I didn't want to spend much energy in getting it in there now, as it can easily be added later.
  4. Regarding skipping the digest of the sorted_names_lengths_pairs attribute:
  • I am all for it and wanted to suggest it before (I believe) Tim did, but then realised it would be like talking to your TV (which I happen to do occasionally!)
  • However, I think the endpoint should return a list of canonicalised strings (encoded in utf8), not a list of objects. The reason is that with the former, you can just canonicalize the list (which in this case is just about removing whitespace) and digest that string to get to the digest. This is exactly what you would to for the other attributes, e.g. "names" or "sequences". If you in this case return a list of objects, you would need to treat 'sorted_names_lengths_pairs' differently than the other attributes.
  • In the RFC, objects are sorted on their keys, e.g. {"names": "chr1", "lengths": 1234} will be canonicalized to b'{"lengths":1234,"names":"chr1"}', encoded in utf8. Lists are obviously not sorted. I think the difference between the two confused you.
  • The main reason I originally thought a double-digest would be fine is that it's guaranteed to be ASCII. The canonicalized JSON is encoded in utf8 and we would need to think about whether that messes with sorting.
  • That said, I am 99% sure sorting an utf8 is clearly defined and not an issue. The only thing we need to make sure is that people actually provide the results in utf8, so that should be stated in the spec somewhere, e.g. all endpoints should provide results with the Content-type: application/json; charset=utf-8 header set
  1. Adding a level option to the attribute endpoint would really be about un-digesting the digests. As a consequence, the sequences attribute for level=2 should really expand into a list of full sequences. Which might be something we want, but expands the issue. To me, just skipping the digest is a simpler and more straightforward solution. I think this is also the conclusion you ended up with, except that I would want the contents to be the canonicalized strings, and not the pre-canonicalized objects.

@nsheff
Copy link
Member Author

nsheff commented Aug 21, 2024

I'm with you on most of this, but I feel pretty strongly that the /attribute endpoint would return objects, and not strings.

This is an API, and it should return JSON -- not a string of a JSON object! I actually think you've got the rationale backwards here:

The reason is that with the former, you can just canonicalize the list (which in this case is just about removing whitespace) and digest that string to get to the digest. This is exactly what you would to for the other attributes, e.g. "names" or "sequences". If you in this case return a list of objects, you would need to treat 'sorted_names_lengths_pairs' differently than the other attributes.

I think for the other attributes, you'd get the object itself -- which is a string. I think you're getting confused because those objects are strings.

So to me, to return strings would be modify the object, whereas in other cases you'd just return the object. So it's actually the opposite of what you say -- if you return a list of objects, everything would be treated the same. If you want to return a list of strings, not only is this much less convenient for the consumer of the API, but you'd also have to treat 'sorted_name_length_pairs' differently from the other attributes.

@sveinugu
Copy link
Collaborator

sveinugu commented Aug 21, 2024

Well, I feel pretty strongly the other way around! 😄

The issue is how the digest algorithm is specified in relation to the schema. The digest algorithm starts with the attributes in the canonical seqcol object representation, which means (I believe) "according to the schema". So the question is then what is the items.type of the sorted-names-lengths-pair attribute? I am not sure if we have specified the schema for this anywhere, but the way I see it, according to the current spec, it should be string, just like sequences. And just like sequences refers to refget identifiers, the strings in the sorted-names-lengths-pair array refer to something else which is external to the main digest algorithm, which is the outcome of the algorithm specified in section 4.1. My suggestion would be to just remove step 3 in that algorithm and change step 4 to sort the canonicalized strings instead of the digests. Your suggestion would entail that we in addition add another step after 4 to de-canonicalize the strings back into objects.

I thought one in addition needed to make a special case out of the sorted-names-lengths-pair attribute in the main algorithm, adding an initial canonicalization step, but you are right in that the digest algorithm would still work as it is with objects as items, the objects would just be canonicalized together with the lists.

The main issue is the sorting step. Without canonicalization, sorting can be performed in several incompatible ways, especially if we support non-ascii keys. So we would want to canonicalize before sorting. Your suggestion adds an (to me, unnecessary) additional step to de-canonicalize the sorted strings.

One advantage of my approach is that it is trivial to get from the contents of a names-lengths-pair attribute (which I think we should now also add to the spec, btw, as this would be the one-to-one match for a "chromlen" file) to the contents of the sorted-names-lengths-pair attribute. As both are lists of strings, you would only need to lexicographically sort one to get to the other. If both are lists of objects, however, you would need to go through a canonicalization-sort-decanonicalization procedure, which is less trivial.

@sveinugu
Copy link
Collaborator

sveinugu commented Aug 21, 2024

How your solution would look in Python:

>>> a = [{"names": "seq1", "lengths": 1000}, {"names": "seq2", "lengths": 500}]
>>> sorted(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'dict' and 'dict'
>>> sorted(a, key=lambda x:x['names'])
[{'names': 'seq1', 'lengths': 1000}, {'names': 'seq2', 'lengths': 500}]
>>> sorted(a, key=lambda x:x['lengths'])
[{'names': 'seq2', 'lengths': 500}, {'names': 'seq1', 'lengths': 1000}]

Compare this to:

>>> a = [b'{"lengths":1000,"names":"seq1"}', b'{"lengths":500,"names":"seq2"}']
>>> sorted(a)
[b'{"lengths":1000,"names":"seq1"}', b'{"lengths":500,"names":"seq2"}']

and I believe you should see my point.

Edit: What I called "de-canonicalization" above is just parsing an utf8-encoded JSON string, e.g. in Python:

>>> import json
>>> json.loads(b'{"lengths":1000,"names":"seq1"}')
{'lengths': 1000, 'names': 'seq1'}

So my suggested output is just a list of JSON strings, the user just needs to parse each one independently to get to the objects. The main problem with your suggestion is that one would need to implement canonicalization in more use cases than with my suggestion. With my suggestion I believe canonicalization would only be needed for calculating the digests. If you need to sort, you can use the strings, and if you want to compare you can either do it with strings if that is what you have, or you can parse the strings into objects to compare with other objects.

@sveinugu
Copy link
Collaborator

sveinugu commented Aug 21, 2024

Which actually points to the main reason I opted for an extra digest step initially: Sorting a utf8-encoded byte-string can easily be confused with sorting a decoded string, but these are different things and I believe they may not sort the same for non-ascii characters. So digesting into an ascii string makes it hard to do the sorting wrong. But the advantage of getting an easily understandable output from the attributes endpoint makes up for the possibility for people to implement the sorting incorrectly. The latter can in any case be mitigated by a good test suite.

docs/seqcol.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants