Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GA4GH-wide pagination syntax #29

Open
jb-adams opened this issue May 10, 2021 · 27 comments
Open

GA4GH-wide pagination syntax #29

jb-adams opened this issue May 10, 2021 · 27 comments
Assignees

Comments

@jb-adams
Copy link
Member

Cloud WS is looking to incorporate a harmonized, GA4GH-wide pagination syntax into their specs. This is most pertinent for DRS, as we want to update the spec to allow bulk object request and hence paginated results.

Pagination is currently handled by a number of GA4GH specs, but there is not necessarily interop between them. A goal would be to converge upon a minimal common pagination schema that existing specs can make use of, that the DRS spec will then uptake if approved.

@dglazer @ianfore

@jb-adams jb-adams self-assigned this May 10, 2021
@jb-adams
Copy link
Member Author

Currently, I'm only aware of the following GA4GH specs that incorporate pagination:

looping in @mcupak and @denis-yuen from Data Connect (DC) and TRS, respectively. Miro and Denis, would you be interested in contributing to a pagination schema that could be used across multiple GA4GH specs?

@denis-yuen
Copy link
Member

denis-yuen commented May 10, 2021

Yup, thanks for the tag.
TRS has pagination for one endpoint https://github.com/ga4gh/tool-registry-service-schemas/blob/2.0.0/openapi/openapi.yaml#L209 based on github pagination https://docs.github.com/en/rest/guides/traversing-with-pagination

We have other endpoints which have grown in usage to the point that we need to add pagination to them as well so definitely would like to keep an eye on this conversation and/or contribute

@dglazer
Copy link
Member

dglazer commented May 10, 2021

As another input to the discussion, here are Google's best practices for API pagination: https://cloud.google.com/apis/design/design_patterns#list_pagination

@mcupak
Copy link
Collaborator

mcupak commented May 10, 2021

Thanks @jb-adams! We also deferred a decision on pagination to a future release of the Service Registry so as to start small and not have to commit to a particular opinionated way of doing it, so that would be a good candidate for taking this up when it's ready as well.

Some more details on pagination in Data Connect. To sum up, we go very minimal - we don't specify a way of requesting pages on the client side, we simply say the server can provide a next page URL, which can be any URL (relative or absolute). That way implementations choose their own pagination strategy, only relying on clients resolving the links. The next page URL is null/empty on the last page, and it's the only field we use in pagination.

@denis-yuen
Copy link
Member

FWIW, that seems to match TRS. We provide next_page, last_page, and self_link and a client could easily only use the first for pagination

@jb-adams
Copy link
Member Author

I was thinking this too, that currently DC's next_page_url and TRS's next_page are more or less analogous. TRS is therefore a bit more opinionated in that it provides:

  • last_page to jump to the end of results
  • self_link for regenerating the same result set later
  • current_offset, current_limit to capture the number of records per paged result.

One idea for a path to a unified schema could involve:

  • using TRS as the starting pagination schema
  • DC for now only supports the next_page attribute, potentially supporting the other attributes later
  • DC has redundant fields: next_page and next_page_url that contain the same URL. A deprecation notice is given that next_page_url will eventually be dropped

Thoughts on the above?

I also see that DC's pagination is given in the response body, whereas TRS's is in the header. I don't think this should be a blocker for the development of the schema itself, as it could be usable in either location (although represented differently either as JSON or HTTP headers, and this will affect the way it's written in the OpenAPI)

@dglazer
Copy link
Member

dglazer commented May 11, 2021

Generally feels good. A few notes:

  • I like next_page as the crux of the answer -- simple and sufficient. I suggest we get the security folks to weigh in on the anti-tampering thoughts in the Google best practices, and whether that might lead us to prefer a next_page_token instead of a full URL; I don't have strong personal opinions
  • does last_page in TRS mean the previous page or the final page? Either way, I suggest making it optional for new APIs. (It feels only occasionally useful for clients, and potentially inefficient for servers to implement.)
  • I think self-link should also be optional, unless I'm missing some compelling use case

We should also standardize request parameters:

  • I think we do want to require an equivalent of the TRS limit parameter
  • I'm not a fan of requiring the TRS offset parameter (although it's a fine option if some APIs want it)

On syntax:

  • it would be nice if all the pagination fields were named consistently with each other (e.g. page_size, next_page_url). But I don't feel strongly about the benefits of that vs, the benefits of compatibility with existing TRS and DC
  • I prefer next_page_url (or next_page_token) to next_page for clarity. But again I don't feel strongly about the benefits of that vs. backwards compatibility.
  • We should take a stand on "in the header" vs. "in the body". My strong-ish preference is "in the body" with all the other response info, but that's mostly for aesthetic vs. hardcore technical reasons. Are there technical arguments either way?

@denis-yuen
Copy link
Member

the previous page or the final page? Either way, I suggest making it optional for new APIs

The final page. I agree that this could be optional since it is a just a convenience to quickly reach the last page (which may be more or less useful depending on how the pages is sorted). Ditto with self link.

We should take a stand on "in the header" vs. "in the body". My strong-ish preference is "in the body" with all the other response info, but that's mostly for aesthetic vs. hardcore technical reasons. Are there technical arguments either way?

I was under the impression that having these kind of links in the header is a standard. Searching around, this argues that GitHub is following rfc 5988

In this matter several choices to implement that appear :

  • Wrap the items within an envelope in the HTTP body, this envelope will also contain field having information on the pagination. You can find variation on the exact implementation some choose json-api, some their home brewed solution. This is probably the most common approach as this does require less HTTP knowledge.
    And it is likely preferred by pure Javascript clients as there’s no need to peek outside the body.
  • Use the HTTP response with pagination meta-data being placed in the headers. That’s the path chosen by Github in their v3 API. That’s also this option that this blog post will explore.
  • More specifically the Github API is using the Link header form the HTTP standard. It allows express relations and how to reach them for a resource.

https://blog.arkey.fr/2017/10/02/pagination_hateoas_link.en/

@dglazer
Copy link
Member

dglazer commented May 11, 2021

Thanks @denis-yuen for the pointer to GitHub API pagination -- I hadn't seen that before, but like it as a pattern, and saying "we're the same as GitHub" is a good story. It has pretty much all the same features as TRS (so would be easy for any TRS server/client to adopt), but uses different syntax (which would be a temporary nuisance).

Mapping my previous opinions to the GitHub syntax, every API that supports pagination:

  • Response
    • MUST be able to return a Link: header
    • MUST include links of type rel="next" in that header
    • MAY include links of type rel="[first|prev|last]" in that header
  • Request
    • SHOULD support a per_page request parameter (which the servers should take as a request not a constraint)
    • MAY support a page request parameter for random access

@ianfore
Copy link

ianfore commented May 12, 2021

A couple of quick comments

  • The DataConnect use of pagination is using it as a way of polling if results are ready yet. i.e. you follow "next page" URLs until data appears. Is that a common paging need? Or a different use case?
  • Beacon is another GA4GH API with a current approach to pagination that should be brought into the discussion

@jb-adams
Copy link
Member Author

I'm supportive of the Link header pagination as described in Github docs. It's a bigger departure from how existing GA4GH specs handle it, which presents a challenge in retrofitting current specs. Though any solution we come up with will involve a retrofit, so better to put in the hard work now to align ourselves with internet standards/best practices.

| The DataConnect use of pagination is using it as a way of polling if results are ready yet. i.e. you follow "next page" URLs until data appears. Is that a common paging need? Or a different use case?

The Link header could potentially serve as a way to poll. If search results are not ready, then a Link header with a single link (i.e. rel=next) could be returned, allowing the client to poll the same endpoint. This might not be considered a best practice as it could be seen as overloading the header.

| Beacon is another GA4GH API with a current approach to pagination that should be brought into the discussion

looping @jrambla into this discussion. Jordi, we are discussing if we can converge around an approach to pagination that will be picked up by multiple GA4GH specs.

@denis-yuen
Copy link
Member

I haven't looked at it in a while, but I vaguely recall some of the difference for TRS might be due to the fact that we were working in swagger 2.0 as opposed to openapi 3.0 now. But I'm commenting totally without double-checking

@dglazer
Copy link
Member

dglazer commented May 12, 2021

Re polling support in DataConnect -- if I'm reading it right, I don't see a problem. The DC spec uses an empty data payload ("a TableData object whose data property is a zero element array") to mean "this page of results isn't ready yet", and clients are expected to use the next_page_url to try again later. It seems fine to keep that convention, but have clients use the rel="next" link instead.

I agree with @ianfore that it's a different use case, and I don't expect most APIs will need to support long-running queries, but I don't mind the slight overloading for those that do, like DataConnect. (And it feels equally clean in either the current syntax or the newly proposed syntax.)

@jrambla
Copy link

jrambla commented Jul 4, 2021

Beacon approach is using a naive old version of pagination, that we haven't revised in several years. For Beacon v2 we are reviewing it. Therefore I would not consider the current approach (skip and limit) as a something to preserve.

The discussion above triggers a couple of fundamental question for me:

  1. How much the APIs must rely on the "transport" protocol? If the answer is that we can use the API request and response structures over other mechanisms like message queues, then relying on HTTP specific stuff (headers) should not be the right approach. As mention above citing a blog, one design principal could be to allow a client to simply tear off the wrapper and rely in the request or response payload. Combining information in both places breaks that design.
  2. Complete protocol agnosticism would not be possible for aspects like security (e.g. OAuth token in the header), but could this dependencies be constrained just to "transversal" aspects (security, logging, etc.)?
  3. Is pagination something "transversal" or "core" for the specification? The fact that we are discussing it here means that is transversal or just a common issue that all of us need to address?
  4. Which design principles are "too advanced" for non-expert implementers? (for Beacon that is important as we aim for humble projects/institutions too, with novice to intermediate developers).

My personal preference in Beacon has been for making it as much as protocol agnostic as possible.
But both the HTTP header and the next_page token options have been raised by different Beacon team members.
I'm still in need of reading more about pagination for performance aspects to have a more informed opinion.

@jb-adams
Copy link
Member Author

Notes from the FASP hackathon on this issue can be found here under item 2B

@dglazer
Copy link
Member

dglazer commented Aug 10, 2021 via email

@dglazer
Copy link
Member

dglazer commented Jun 22, 2022

Checking in, since this was mentioned at the Steering Committee just now as ongoing, but I haven't seen any updates in almost a year -- what's the current status, and how close are we to having a votable PR?

@denis-yuen
Copy link
Member

FWIW, probably not very useful, but I haven't seen/heard anything on my end.

@jrambla
Copy link

jrambla commented Jun 26, 2022

Me neither, althogh, as you can deduct, we have something already in the approved Beacon v2

@mamanambiya
Copy link
Collaborator

@dglazer, you are right; nothing much has happened since the last discussions before last year's FASP hackathon. We will table this issue for discussion at TASC's upcoming meeting. It looks like the GitHub-like approach is what most would go for.

@patmagee
Copy link

In most of the REST API's that I have used, Pagination is not strictly in the header but gets incorporated directly in the response body. As an implementor, putting the pagination in the headers adds a bit of a conundrum since we often fetch data from a remote serve and pass the response body back to a UI or client. At this stage all of the context around headers has been stripped out and is no longer accessible to the end users. Additionally, placing pagination outside of the Response Body means that any internal handling of that response needs to pass more information around then just translating the body from json into some sort of Object.

Another issue is one of practical integration with other tools ie curl. IMO it's a lot easier to pipe the JSON content of a request to something like jq and check to see if there is any pagination information, then it is to try to extract the pagination info from the curl output. Not to mention headers are not even displayed by default, so the user would not necessarily know the response has any pagination at all

I am definitely in favour of having all information related to pagination directly in the response body. It is easy to interpret, it is disjoint from the actual original request (ie you can look at a JSON alone and determine what the next page should be), and IMO its going to be the easiest thing for people to implement on both sides of the implementation spectrum (IE servers and clients)

I am a bit concerned about the Github Header based approach of pagination

@patmagee
Copy link

It is also worth asking the question of who the end user is, and how we expect they will interact with our apis.

GitHub's approach is a tried and true method as evidenced by its usage in the GitHub api. That being said, I don't really know anyone other then application developers who've used the GitHub api directly.
If we anticipate end users to interact with the apis directly ie through curl, then not having info in the request body makes their life harder

@jrambla
Copy link

jrambla commented Nov 29, 2022

I'm still more keen on the body option, as stated in my comment on Jul 4, 2021.
Making the protocols agnostic to the transport/serialization make them more "portable" and robust.

@patmagee
Copy link

@jrambla That is a good point. I think Data Connect demonstrated the power of using pagination within the body by allowing you to create a fully functioning API WITH pagination strictly by using static files plopped in a bucket.

@denis-yuen
Copy link
Member

I was under the impression that having these kind of links in the header is a standard. Searching around, this argues that GitHub is following rfc 5988

FWIW, did a bit more looking around. It looks like rfc 5988 is followed by rfc 8288 which has a few more examples.
https://www.rfc-editor.org/rfc/rfc5988#section-5.5
https://www.rfc-editor.org/rfc/rfc8288#section-3.5

Can also search github for examples of libraries and code that use them, for example https://github.com/search?q=rfc8288&type=commits

If we do end up rolling our own standard in the response body, I would suggest keeping the link headers so that we remain compatible even if it is redundant.

@patmagee
Copy link

@denis-yuen thanks for the links to the rfc. I think it's 100% acceptable to allow both approaches, but just decide that there is a minimum approach that all apis should have in regards to pagination.

@mamanambiya mamanambiya pinned this issue Apr 19, 2023
@mamanambiya mamanambiya assigned andrewyatz and mamanambiya and unassigned jb-adams Jun 5, 2023
@andrewyatz
Copy link
Contributor

Draft proposal will be distributed asap for Monday's TASC meeting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants