Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use range-based pagination for versions #1066

Merged
merged 7 commits into from
Jan 25, 2023

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Jan 20, 2023

This is a super rough first cut at the logic for range-based pagination for versions (see #579). It’s ugly, problematic, and absolutely not mergeable as-is, but lets us test things out. Still needs:

  • Cleanup and encapsulation. (Maybe pull this stuff out into a concern so it can be used on other models.)
  • Migration to add the appropriate indexes ((capture_time, uuid), (created_at, uuid) maybe also (page_uuid, capture_time, uuid), (page_uuid, created_at, uuid)).
  • Tests.
  • Maybe extend to pages? (Not sure it’s worthwhile, there aren’t critical problems there like there are for versions)

Update 2023-01-25: I’m not going to extend this to other endpoints. While it would be nice, the real priority right now is doing the minimum required for edgi-govdata-archiving/web-monitoring#168

Overall Summary

The slow performance of offset-based pagination (when you get a batch of results with a query like SELECT xyz FROM table OFFSET = n LIMIT = m) is a serious problem on the the versions table (really any large table, but in practice it’s just versions for this app). This solves the problem by using range-based pagination (skipping to the place you want in the result set using a WHERE clause that filters to a range of values from at the same field(s) you are sorting by).

For the versions table, that means doing queries based on a different minimum/maximum (capture_time, uuid) or (created_at, uuid) for each page/batch/chunk of results rather than using SQL's OFFSET clause to get to the chunk we want. This is more-or-less compatible with with any existing client, since the next link still works and the chunk_size parameter also works the same.

On the other hand, this introduces some new restrictions:

  • You can only sort by one field, and it must be either capture_time or created_at instead of any field(s) you want. We need an index for the fields we want to sort by, and these two are the ones we chose.

  • Clients shouldn’t usually be setting the chunk parameter on their own, and will run into problems if they do, since that field is no longer an integer (now it is either a version UUID or a combined "<timestamp>,<uuid>" string).

  • The ?include_total=true parameter is no longer supported for versions. There’s just no performant way to do it without additional caching of that field somewhere (count() in Postgres always requires a full table scan; the entire point of this change is to prevent us from doing large scans in order to paginate results).

There’s also plenty in here that’s not perfectly clean, and not totally generalized. If we wanted to utilize this kind of pagination for other models/controllers, it would need some work to make more abstract.

Mr0grog added a commit that referenced this pull request Jan 24, 2023
Before we can actually do range-based pagination in any usefully workable way, we need to have indexes in place that support it (partially because the primary key for versions is a UUID, which is unordered -- this wouldn't be such a big deal with bigint). These indexes will take a while to build, so it's good to get these in place ahead of time.

A future change will remove the old, non-compound indexes that will be redundant after these new ones finish building.

Addresses part of #571, and required before deploying #1066.
Mr0grog added a commit that referenced this pull request Jan 24, 2023
Before we can actually do range-based pagination in any usefully workable way, we need to have indexes in place that support it (partially because the primary key for versions is a UUID, which is unordered -- this wouldn't be such a big deal with bigint). These indexes will take a while to build, so it's good to get these in place ahead of time.

A future change will remove the old, non-compound indexes that will be redundant after these new ones finish building.

Addresses part of #571, and required before deploying #1066.
This implements the logic for #579 in an ugly, inline way so I could test it out. It definitely needs a lot of cleanup before it can be merged. Also needs a migration to add indexes.
@Mr0grog Mr0grog force-pushed the 579-performant-pagination-for-versions branch from 8e40414 to 0644db7 Compare January 24, 2023 21:46
@Mr0grog Mr0grog marked this pull request as ready for review January 25, 2023 19:33
@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 25, 2023

⚠️ Some important, big deal changes here!

  • This changes the default sorting for /api/v0/versions. It is now capture_time:asc. (It was the only thing that sorted in descending order by default; now everything should default ascending.) I couldn’t find anywhere in other codebases where we use this endpoint without explicitly specifying sorting, so it should be ok.
  • This changes the default sorting for /api/v0/pages/<uuid>/versions! It is now capture_time:asc (or more importantly, the same as /api/v0/versions) instead of capture_time:desc.
  • ?include_total=true is no longer supported for versions. I can’t find anywhere we actually use this for versions (we definitely use it for pages) — probably because of the performance issue — so I think this is OK.

@Mr0grog Mr0grog merged commit 4679d10 into main Jan 25, 2023
@Mr0grog Mr0grog deleted the 579-performant-pagination-for-versions branch January 25, 2023 23:19
Mr0grog added a commit that referenced this pull request Jan 25, 2023
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jan 25, 2023
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant