Rethink Pagination #579

Mr0grog · 2019-08-20T19:28:32Z

Last week, I made a bunch of changes to our database’s configuration (edgi-govdata-archiving/web-monitoring-ops#26) and to the indexes for the versions table (#548) to address massive performance issues we were hitting. Those have made a big difference, but we have one remaining major problem: the count(*) queries we use for pagination.

Most of our endpoints that return lists (e.g. /api/v0/versions) return a json:api inspired structure like:

{
  "links": {
    "first": "https://api.monitoring.envirodatagov.org/api/v0/versions?chunk=1&chunk_size=100",
    "last": "https://api.monitoring.envirodatagov.org/api/v0/versions?chunk=23042&chunk_size=100",
    "prev": null,
    "next": "https://api.monitoring.envirodatagov.org/api/v0/versions?chunk=2&chunk_size=100"
  },
  "meta": {
    "total_results": 2304117
  },
  "data":[...]
}

In order to figure out the last page, the first thing we always do is run a count(*) version of the query that would produce our results (e.g. for SELECT * FROM versions;, we’d first do SELECT count(*) FROM versions;). That lets us calculate how many chunks/pages of results we have. As a bonus, we added the meta.total_results field, which several active use cases (e.g. the healthcheck script) now depend on.

Unfortunately, count(*) is pretty slow in Postgres, especially if the whole table does not fit in memory. For example, GET /api/v0/versions currently takes a little over 18 seconds: 18 seconds for the count, and 30 milliseconds for the actual data. While use cases like the healthcheck require some kind of support for count, we need to take count out of the critical path for pagination.

Some ideas:

Do some fancy footwork and cache the count results, either in a table or in Redis. Every time we import, we would clear the cache.
Postgres 9.5+ has a fancy-pants sampling system for doing quick estimations: SELECT 100 * count(*) AS estimated_total FROM versions TABLESAMPLE SYSTEM (1); We could use estimated counts. (This still gets slower as the table grows, but relatively speaking, it’s quite speedy.)
Just remove links.last and meta.total_results from the data and determine if we are on the last page based on whether there were < chunk_size results. For applications that need total_results, we could:
1. Have a query parameter like ?include_total that runs the count query.
2. Have a new endpoint on all collections like /api/v0/versions/count that runs just the count query.
Switch to a fundamentally different kind of pagination based on the actual data. (Twitter’s API is one example of this; you ask for a number of results and a key to return results after.) This would be a big shift, and comes with some pros and cons:
- 👍 Super fast! Offset-based pagination like ours gets slower on each subsequent page of results. This can leverage an index to get to the beginning of the results pretty quickly.
- 👍 You never miss or duplicate results because new data got added while you were iterating through pages of results.
- 👎 Not very flexible. We currently allow sorting on pretty much any column, and this would require us to limit those options (you have to have a sortable column with unique values☨ for this technique). On the other hand, I’m not sure of any common use-cases we have that sort on anything other than created_at, updated_at, or capture_time, so this might not be too bad.
- 👎 It’s harder to randomly skip around because the key you resume from is probably unpredictable. (See the health-check use case.)
@bensheldon suggested the order_query gem for this on Twitter. This site also has lots of info on the general technique: https://use-the-index-luke.com/no-offset

I don’t think (1) and (2) are good solutions. (1) only works well for tables that don’t update often (e.g. good for versions, so long as we keep doing big batch imports, but bad for annotations and changes) and is a lot of extra hacky stuff to add. (2) will still slow as the table grows, and means we could possibly suggest there are more or fewer pages than there actually are.

(3) Feels like a good short term solution. I prefer (3.2) to (3.1) even though it’s more work, but don’t have a strong preference.

(4) Feels like the right solution to me, but I’m also OK putting it off a little bit. It’s not immediately obvious to me how to efficiently do random sampling for the health-check use case with it. (Maybe just pick a bunch of random dates to query by?) On the other hand, putting it off is a recipe for never doing it.

cc @danielballan

☨ In Postgres, you can use a multicolumn key, like uuid + created_at to get sortable uniqueness, so we don’t need to invent new columns here. :)

The text was updated successfully, but these errors were encountered:

Mr0grog · 2019-08-20T19:32:37Z

One other note here: this is mainly a problem for the /api/v0/versions endpoint and, to a much lesser extent, the /api/v0/pages endpoint. I think it’s important to take a universal approach across all endpoints, though.

Mr0grog · 2019-08-20T19:34:39Z

This will affect the healthcheck script, the python API wrapper, our analyst sheet generation tools, the UI project, and any other tools people like @ericnost have developed.

SYU15 · 2019-08-21T05:13:11Z

So for solution 3.2, is the implication that we'd need to make a new API call to get the total number of pages to generate the pagination? Why is 3.1 preferable? Mainly because the API design will be cleaner? I have a slight preference for minimizing the number of calls made from the UI but if one approach is significantly faster/better, it's not that big of a deal.

Mr0grog · 2019-08-21T07:34:26Z

I have a slight preference for minimizing the number of calls made from the UI but if one approach is significantly faster/better, it's not that big of a deal.

Ah! So I think the bigger thing here is that in any version of (3) or (4), you should avoid ever asking for the number of pages/chunks. You wouldn’t get it by default, and the intended case is that that you just keep iterating to the next chunk, and the next chunk, until you hit the end (because it takes milliseconds to look up a single chunk, but many seconds, during which the app and database are both somewhat tied up, to count how many chunks there are). It’s mainly an option for narrow cases where you really need to know or for backwards compatibility.

So for solution 3.2, is the implication that we'd need to make a new API call to get the total number of pages to generate the pagination?

Yep.

Why is 3.1 preferable? Mainly because the API design will be cleaner? I have a slight preference for minimizing the number of calls…

Yeah, I was partially thinking (3.2) would be preferable because it feels cleaner, but also because forcing you to make a separate call makes you think a lot harder before doing the really expensive, costly thing.

That said, it’s a good point that adding ?include_total to a normal call would let you do two things at once and be at least slightly more efficient. I guess I wasn’t really thinking about that.

Mr0grog · 2019-08-23T16:49:05Z

OK, I think the first step here is going to be implementing (3.1). Then we can come back and evaluate whether and when to do (4).

Update healthcheck script to use ?include_total.
Update weekly analyst task sheet script to not check totals or last page link.
Update DB to only show meta.total_records and links.last when ?include_total is set.

danielballan · 2019-08-24T17:07:51Z

Yeah, I was partially thinking (3.2) would be preferable because it feels cleaner, but also because forcing you to make a separate call makes you think a lot harder before doing the really expensive, costly thing.

Weighing in a bit late... For what it's worth, I like 3.2 a little better than 3.1 for this reason, but the argument for 3.1 is fair and, as you say, easier to implement.

For large tables, counting the number of possible results (in order to determine the URL for the *last* chunk in a result set) turns out to be extremly expensive. (For example, in our production `versions` table, the main query takes < 20 milliseconds, but counting the totals takes 18,000 milliseconds.) Instead, only count the total results and only include the URL for the last chunk if a user explicitly requests it. In the future, we might move to an entirely different paging structure (that's not offset-based). See #579 for more.

Mr0grog · 2019-09-12T23:31:11Z

Did some experiments with (4), which is quite speedy as you get deep into the result set. We’ll need to create a combined index on (<sort_column>, uuid), e.g:

CREATE INDEX idx_test_created_at_uuid ON versions (created_at, uuid);

(Since we use UUIDs instead of an int sequence, performance is almost unimaginably bad without a specialized index for this multi-key sort.)

Then we can query subsequent results with some slightly custom SQL. I assume this is roughly what the order_query gem would do for us. Example with overlapping result sets so we can verify the results are contiguous and maintain correct sorting:

format_version = lambda {|v| puts "#{v.uuid} / #{v.created_at.to_f} / #{v.capture_url}"}

q = Version.order(created_at: :asc, uuid: :asc).limit(5).each(&format_version); 'ok'
>>> 62bb71ec-1b9b-47cf-bf65-dec432bfce35 / 1494306541.307227 / http://www.nrel.gov/esif
>>> 227d89f1-5815-49d2-85b7-28e9cce13ec7 / 1494306541.3260028 / http://www.nrel.gov/esif
>>> 95fa1bc2-a50b-4b6b-81f3-8eed37b55bcf / 1494306541.34397 / http://www.nrel.gov/esif
>>> 58443e20-656a-45be-94b4-eb239769846a / 1494306541.365584 / http://www.nrel.gov/esif
>>> 78c1155e-402e-4d15-afcb-ada0c5bf4275 / 1494306541.384585 / http://www.nrel.gov/esif
>>> "ok"

q2 = Version.order(created_at: :asc, uuid: :asc).limit(5).where('(versions.created_at, versions.uuid) > (?, ?)', q[2].created_at, q[2].uuid).each(&format_version); 'ok'
>>> 58443e20-656a-45be-94b4-eb239769846a / 1494306541.365584 / http://www.nrel.gov/esif
>>> 78c1155e-402e-4d15-afcb-ada0c5bf4275 / 1494306541.384585 / http://www.nrel.gov/esif
>>> 31ad361c-74f0-459d-abd9-3cfbbb95549e / 1494306541.403688 / http://www.nrel.gov/esif
>>> ef913eb4-8496-4b73-959c-1b1e235e7e87 / 1494306541.428201 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 81f4ab70-7a2c-4ba0-b3d3-0558cc6d61b9 / 1494306541.4446778 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> "ok"

q3 = Version.order(created_at: :asc, uuid: :asc).limit(5).where('(versions.created_at, versions.uuid) > (?, ?)', q2[2].created_at, q2[2].uuid).each(&format_version); 'ok'
>>> ef913eb4-8496-4b73-959c-1b1e235e7e87 / 1494306541.428201 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 81f4ab70-7a2c-4ba0-b3d3-0558cc6d61b9 / 1494306541.4446778 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> c77e661b-96b8-4153-a910-3002118587d4 / 1494306541.46063 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 3b22ba1b-bba7-4d62-8dfc-c277a8161892 / 1494306541.4770062 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 0b6c24e6-d977-411c-9cee-f02b2ee14307 / 1494306541.497889 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> "ok"

Mr0grog · 2019-09-12T23:34:20Z

Still need to experiment with (a) reverse ordering (is PG still smart enough to use the combined index?) and (b) adding more where conditions.

For large tables, counting the number of possible results (in order to determine the URL for the *last* chunk in a result set) turns out to be extremely expensive. (For example, in our production `versions` table, the main query takes < 20 ms, but counting the totals takes 18,000 ms.) Instead, only count the total results (`meta.total_results` in the response) and only include the URL for the last chunk (`links.last` in the response) if a user explicitly requests it using the `?include_total=true` query param. In the future, we might move to an entirely different paging structure (that's not offset-based). See #579 for more.

Mr0grog · 2023-01-20T19:12:59Z

Coming back to this way later:

Yes it works in reverse, though you have to be careful to specify the order on both parts of the compound key (e.g. created_at: :desc, uuid: :desc).
For versions, we also commonly scope queries by page. The indexes described above don’t help with that (instead, PG combines the existing indexes on created_at and on page_uuid), but if there’s another index with page_uuid first (not last), it will get used and make things pretty speedy. Scoping by page already slims down the data set enough that things are reasonably fast, so this may not be worthwhile. Or maybe it’s useful to replace the index on page_uuid.

This implements the logic for #579 in an ugly, inline way so I could test it out. It definitely needs a lot of cleanup before it can be merged. Also needs a migration to add indexes.

Mr0grog · 2023-01-25T23:40:04Z

This is done for versions, and I’m marking it as complete. That said, it is not in place for any other controllers/models. I don’t currently plan to implement it anywhere else right now, since other parts of the app are not in major need of it.

Mr0grog added question API Changes to the public API [priority-★★☆] labels Aug 20, 2019

Mr0grog self-assigned this Aug 23, 2019

Mr0grog mentioned this issue Sep 4, 2019

Only include totals in list endpoints if ?include_total=true #596

Merged

3 tasks

stale bot added the stale label Mar 11, 2020

Mr0grog removed the stale label Mar 11, 2020

edgi-govdata-archiving deleted a comment from stale bot Mar 11, 2020

stale bot added the stale label Sep 7, 2020

Mr0grog added never-stale and removed stale labels Sep 8, 2020

edgi-govdata-archiving deleted a comment from stale bot Sep 8, 2020

Mr0grog mentioned this issue Jan 17, 2023

Put this project to rest edgi-govdata-archiving/web-monitoring#168

Closed

37 tasks

Mr0grog mentioned this issue Jan 20, 2023

Use range-based pagination for versions #1066

Merged

4 tasks

Mr0grog mentioned this issue Jan 25, 2023

Add new indexes for range-based pagination #1068

Merged

Mr0grog closed this as completed Jan 25, 2023

Mr0grog mentioned this issue Jan 26, 2023

Make API publicly accessible for reading non-user data #1069

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink Pagination #579

Rethink Pagination #579

Mr0grog commented Aug 20, 2019 •

edited

Loading

Mr0grog commented Aug 20, 2019

Mr0grog commented Aug 20, 2019

SYU15 commented Aug 21, 2019

Mr0grog commented Aug 21, 2019

Mr0grog commented Aug 23, 2019 •

edited

Loading

danielballan commented Aug 24, 2019

Mr0grog commented Sep 12, 2019

Mr0grog commented Sep 12, 2019

Mr0grog commented Jan 20, 2023

Mr0grog commented Jan 25, 2023

Rethink Pagination #579

Rethink Pagination #579

Comments

Mr0grog commented Aug 20, 2019 • edited Loading

Mr0grog commented Aug 20, 2019

Mr0grog commented Aug 20, 2019

SYU15 commented Aug 21, 2019

Mr0grog commented Aug 21, 2019

Mr0grog commented Aug 23, 2019 • edited Loading

danielballan commented Aug 24, 2019

Mr0grog commented Sep 12, 2019

Mr0grog commented Sep 12, 2019

Mr0grog commented Jan 20, 2023

Mr0grog commented Jan 25, 2023

Mr0grog commented Aug 20, 2019 •

edited

Loading

Mr0grog commented Aug 23, 2019 •

edited

Loading