-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink Pagination #579
Comments
One other note here: this is mainly a problem for the |
This will affect the healthcheck script, the python API wrapper, our analyst sheet generation tools, the UI project, and any other tools people like @ericnost have developed. |
So for solution |
Ah! So I think the bigger thing here is that in any version of (3) or (4), you should avoid ever asking for the number of pages/chunks. You wouldn’t get it by default, and the intended case is that that you just keep iterating to the next chunk, and the next chunk, until you hit the end (because it takes milliseconds to look up a single chunk, but many seconds, during which the app and database are both somewhat tied up, to count how many chunks there are). It’s mainly an option for narrow cases where you really need to know or for backwards compatibility.
Yep.
Yeah, I was partially thinking (3.2) would be preferable because it feels cleaner, but also because forcing you to make a separate call makes you think a lot harder before doing the really expensive, costly thing. That said, it’s a good point that adding |
OK, I think the first step here is going to be implementing (3.1). Then we can come back and evaluate whether and when to do (4).
|
Weighing in a bit late... For what it's worth, I like 3.2 a little better than 3.1 for this reason, but the argument for 3.1 is fair and, as you say, easier to implement. |
For large tables, counting the number of possible results (in order to determine the URL for the *last* chunk in a result set) turns out to be extremly expensive. (For example, in our production `versions` table, the main query takes < 20 milliseconds, but counting the totals takes 18,000 milliseconds.) Instead, only count the total results and only include the URL for the last chunk if a user explicitly requests it. In the future, we might move to an entirely different paging structure (that's not offset-based). See #579 for more.
Did some experiments with (4), which is quite speedy as you get deep into the result set. We’ll need to create a combined index on CREATE INDEX idx_test_created_at_uuid ON versions (created_at, uuid); (Since we use UUIDs instead of an int sequence, performance is almost unimaginably bad without a specialized index for this multi-key sort.) Then we can query subsequent results with some slightly custom SQL. I assume this is roughly what the format_version = lambda {|v| puts "#{v.uuid} / #{v.created_at.to_f} / #{v.capture_url}"}
q = Version.order(created_at: :asc, uuid: :asc).limit(5).each(&format_version); 'ok'
>>> 62bb71ec-1b9b-47cf-bf65-dec432bfce35 / 1494306541.307227 / http://www.nrel.gov/esif
>>> 227d89f1-5815-49d2-85b7-28e9cce13ec7 / 1494306541.3260028 / http://www.nrel.gov/esif
>>> 95fa1bc2-a50b-4b6b-81f3-8eed37b55bcf / 1494306541.34397 / http://www.nrel.gov/esif
>>> 58443e20-656a-45be-94b4-eb239769846a / 1494306541.365584 / http://www.nrel.gov/esif
>>> 78c1155e-402e-4d15-afcb-ada0c5bf4275 / 1494306541.384585 / http://www.nrel.gov/esif
>>> "ok"
q2 = Version.order(created_at: :asc, uuid: :asc).limit(5).where('(versions.created_at, versions.uuid) > (?, ?)', q[2].created_at, q[2].uuid).each(&format_version); 'ok'
>>> 58443e20-656a-45be-94b4-eb239769846a / 1494306541.365584 / http://www.nrel.gov/esif
>>> 78c1155e-402e-4d15-afcb-ada0c5bf4275 / 1494306541.384585 / http://www.nrel.gov/esif
>>> 31ad361c-74f0-459d-abd9-3cfbbb95549e / 1494306541.403688 / http://www.nrel.gov/esif
>>> ef913eb4-8496-4b73-959c-1b1e235e7e87 / 1494306541.428201 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 81f4ab70-7a2c-4ba0-b3d3-0558cc6d61b9 / 1494306541.4446778 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> "ok"
q3 = Version.order(created_at: :asc, uuid: :asc).limit(5).where('(versions.created_at, versions.uuid) > (?, ?)', q2[2].created_at, q2[2].uuid).each(&format_version); 'ok'
>>> ef913eb4-8496-4b73-959c-1b1e235e7e87 / 1494306541.428201 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 81f4ab70-7a2c-4ba0-b3d3-0558cc6d61b9 / 1494306541.4446778 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> c77e661b-96b8-4153-a910-3002118587d4 / 1494306541.46063 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 3b22ba1b-bba7-4d62-8dfc-c277a8161892 / 1494306541.4770062 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> 0b6c24e6-d977-411c-9cee-f02b2ee14307 / 1494306541.497889 / http://www.nrel.gov/sustainable_nrel/rsf.html
>>> "ok" |
Still need to experiment with (a) reverse ordering (is PG still smart enough to use the combined index?) and (b) adding more |
For large tables, counting the number of possible results (in order to determine the URL for the *last* chunk in a result set) turns out to be extremely expensive. (For example, in our production `versions` table, the main query takes < 20 ms, but counting the totals takes 18,000 ms.) Instead, only count the total results (`meta.total_results` in the response) and only include the URL for the last chunk (`links.last` in the response) if a user explicitly requests it using the `?include_total=true` query param. In the future, we might move to an entirely different paging structure (that's not offset-based). See #579 for more.
Coming back to this way later:
|
This implements the logic for #579 in an ugly, inline way so I could test it out. It definitely needs a lot of cleanup before it can be merged. Also needs a migration to add indexes.
This implements the logic for #579 in an ugly, inline way so I could test it out. It definitely needs a lot of cleanup before it can be merged. Also needs a migration to add indexes.
This is done for versions, and I’m marking it as complete. That said, it is not in place for any other controllers/models. I don’t currently plan to implement it anywhere else right now, since other parts of the app are not in major need of it. |
Last week, I made a bunch of changes to our database’s configuration (edgi-govdata-archiving/web-monitoring-ops#26) and to the indexes for the versions table (#548) to address massive performance issues we were hitting. Those have made a big difference, but we have one remaining major problem: the
count(*)
queries we use for pagination.Most of our endpoints that return lists (e.g.
/api/v0/versions
) return a json:api inspired structure like:In order to figure out the
last
page, the first thing we always do is run acount(*)
version of the query that would produce our results (e.g. forSELECT * FROM versions;
, we’d first doSELECT count(*) FROM versions;
). That lets us calculate how many chunks/pages of results we have. As a bonus, we added themeta.total_results
field, which several active use cases (e.g. the healthcheck script) now depend on.Unfortunately,
count(*)
is pretty slow in Postgres, especially if the whole table does not fit in memory. For example,GET /api/v0/versions
currently takes a little over 18 seconds: 18 seconds for the count, and 30 milliseconds for the actual data. While use cases like the healthcheck require some kind of support forcount
, we need to takecount
out of the critical path for pagination.Some ideas:
Do some fancy footwork and cache the
count
results, either in a table or in Redis. Every time we import, we would clear the cache.Postgres 9.5+ has a fancy-pants sampling system for doing quick estimations:
SELECT 100 * count(*) AS estimated_total FROM versions TABLESAMPLE SYSTEM (1);
We could use estimated counts. (This still gets slower as the table grows, but relatively speaking, it’s quite speedy.)Just remove
links.last
andmeta.total_results
from the data and determine if we are on the last page based on whether there were< chunk_size
results. For applications that needtotal_results
, we could:?include_total
that runs thecount
query./api/v0/versions/count
that runs just the count query.Switch to a fundamentally different kind of pagination based on the actual data. (Twitter’s API is one example of this; you ask for a number of results and a key to return results after.) This would be a big shift, and comes with some pros and cons:
created_at
,updated_at
, orcapture_time
, so this might not be too bad.@bensheldon suggested the
order_query
gem for this on Twitter. This site also has lots of info on the general technique: https://use-the-index-luke.com/no-offsetI don’t think (1) and (2) are good solutions. (1) only works well for tables that don’t update often (e.g. good for versions, so long as we keep doing big batch imports, but bad for annotations and changes) and is a lot of extra hacky stuff to add. (2) will still slow as the table grows, and means we could possibly suggest there are more or fewer pages than there actually are.
(3) Feels like a good short term solution. I prefer (3.2) to (3.1) even though it’s more work, but don’t have a strong preference.
(4) Feels like the right solution to me, but I’m also OK putting it off a little bit. It’s not immediately obvious to me how to efficiently do random sampling for the health-check use case with it. (Maybe just pick a bunch of random dates to query by?) On the other hand, putting it off is a recipe for never doing it.
cc @danielballan
☨ In Postgres, you can use a multicolumn key, like
uuid + created_at
to get sortable uniqueness, so we don’t need to invent new columns here. :)The text was updated successfully, but these errors were encountered: