Use range-based pagination for versions #1066

Mr0grog · 2023-01-20T21:11:28Z

This is a super rough first cut at the logic for range-based pagination for versions (see #579). It’s ugly, problematic, and absolutely not mergeable as-is, but lets us test things out. Still needs:

Cleanup and encapsulation. (Maybe pull this stuff out into a concern so it can be used on other models.)
Migration to add the appropriate indexes ((capture_time, uuid), (created_at, uuid) maybe also (page_uuid, capture_time, uuid), (page_uuid, created_at, uuid)).
Tests.
Maybe extend to pages? (Not sure it’s worthwhile, there aren’t critical problems there like there are for versions)

Update 2023-01-25: I’m not going to extend this to other endpoints. While it would be nice, the real priority right now is doing the minimum required for edgi-govdata-archiving/web-monitoring#168

Overall Summary

The slow performance of offset-based pagination (when you get a batch of results with a query like SELECT xyz FROM table OFFSET = n LIMIT = m) is a serious problem on the the versions table (really any large table, but in practice it’s just versions for this app). This solves the problem by using range-based pagination (skipping to the place you want in the result set using a WHERE clause that filters to a range of values from at the same field(s) you are sorting by).

For the versions table, that means doing queries based on a different minimum/maximum (capture_time, uuid) or (created_at, uuid) for each page/batch/chunk of results rather than using SQL's OFFSET clause to get to the chunk we want. This is more-or-less compatible with with any existing client, since the next link still works and the chunk_size parameter also works the same.

On the other hand, this introduces some new restrictions:

You can only sort by one field, and it must be either capture_time or created_at instead of any field(s) you want. We need an index for the fields we want to sort by, and these two are the ones we chose.
Clients shouldn’t usually be setting the chunk parameter on their own, and will run into problems if they do, since that field is no longer an integer (now it is either a version UUID or a combined "<timestamp>,<uuid>" string).
The ?include_total=true parameter is no longer supported for versions. There’s just no performant way to do it without additional caching of that field somewhere (count() in Postgres always requires a full table scan; the entire point of this change is to prevent us from doing large scans in order to paginate results).

There’s also plenty in here that’s not perfectly clean, and not totally generalized. If we wanted to utilize this kind of pagination for other models/controllers, it would need some work to make more abstract.

Before we can actually do range-based pagination in any usefully workable way, we need to have indexes in place that support it (partially because the primary key for versions is a UUID, which is unordered -- this wouldn't be such a big deal with bigint). These indexes will take a while to build, so it's good to get these in place ahead of time. A future change will remove the old, non-compound indexes that will be redundant after these new ones finish building. Addresses part of #571, and required before deploying #1066.

This implements the logic for #579 in an ugly, inline way so I could test it out. It definitely needs a lot of cleanup before it can be merged. Also needs a migration to add indexes.

Mr0grog · 2023-01-25T19:42:31Z

⚠️ Some important, big deal changes here!

This changes the default sorting for /api/v0/versions. It is now capture_time:asc. (It was the only thing that sorted in descending order by default; now everything should default ascending.) I couldn’t find anywhere in other codebases where we use this endpoint without explicitly specifying sorting, so it should be ok.
This changes the default sorting for /api/v0/pages/<uuid>/versions! It is now capture_time:asc (or more importantly, the same as /api/v0/versions) instead of capture_time:desc.
?include_total=true is no longer supported for versions. I can’t find anywhere we actually use this for versions (we definitely use it for pages) — probably because of the performance issue — so I think this is OK.

Mr0grog mentioned this pull request Jan 24, 2023

Add new indexes for range-based pagination #1068

Merged

Mr0grog added 2 commits January 24, 2023 13:46

Supremely rough first cut at range-based pagination for versions

4429f78

This implements the logic for #579 in an ugly, inline way so I could test it out. It definitely needs a lot of cleanup before it can be merged. Also needs a migration to add indexes.

Clean up the logic a bit

0644db7

Mr0grog force-pushed the 579-performant-pagination-for-versions branch from 8e40414 to 0644db7 Compare January 24, 2023 21:46

Mr0grog added 5 commits January 25, 2023 10:57

Update and add tests

f5662b1

Wrap it up in a familiar pagination() interface

03b57e4

A bit more ordering cleanup

7c124b6

Add migration to drop now-redundant indexes

8e9351e

Oops, committed botched version of migration

dd684ec

Mr0grog marked this pull request as ready for review January 25, 2023 19:33

Mr0grog merged commit 4679d10 into main Jan 25, 2023

Mr0grog deleted the 579-performant-pagination-for-versions branch January 25, 2023 23:19

Mr0grog added a commit that referenced this pull request Jan 25, 2023

Release #1066

ba8f600

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jan 25, 2023

Deploy edgi-govdata-archiving/web-monitoring-db#1066 to staging

93735c5

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jan 25, 2023

Deploy edgi-govdata-archiving/web-monitoring-db#1066 to production

26208ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use range-based pagination for versions #1066

Use range-based pagination for versions #1066

Mr0grog commented Jan 20, 2023 •

edited

Loading

Mr0grog commented Jan 25, 2023 •

edited

Loading

Use range-based pagination for versions #1066

Use range-based pagination for versions #1066

Conversation

Mr0grog commented Jan 20, 2023 • edited Loading

Overall Summary

Mr0grog commented Jan 25, 2023 • edited Loading

Mr0grog commented Jan 20, 2023 •

edited

Loading

Mr0grog commented Jan 25, 2023 •

edited

Loading