-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Denormalize Page#latest
and Page#earliest
#858
Comments
The "last capture date" column in the list view required loading the latest version for each page, which has *major* performance issues (see edgi-govdata-archiving/web-monitoring-db#858). It can take up to a minute to load! Alleviate these issues for now by just dropping that column and not attempting to load data about the latest version of each page. While we're at it, I added in tags to this view, since we've got newfound empty space and they are probably relevant here.
The "last capture date" column in the list view required loading the latest version for each page, which has *major* performance issues (see edgi-govdata-archiving/web-monitoring-db#858). It can take up to a minute to load! Alleviate these issues for now by just dropping that column and not attempting to load data about the latest version of each page. While we're at it, I added in tags to this view, since we've got newfound empty space and they are probably relevant here.
|
Tested out adding an index on Whether I do the current
|
Well, I had a stroke of very twisted brilliance while looking at the EXPLAIN output. If you only select the fields that are in the index, it figures out how to do all the work in one step directly on the index. So you can combine that with a join to get this horrifying but quite speedy approach: SELECT versions.*
FROM versions
-- This subquery gets the versions we care about in one step purely by using the new index,
-- but we need the join because the subquery doesn't have all the other fields we care
-- about (adding them completely deoptimizes the subquery).
INNER JOIN (
SELECT
DISTINCT ON (page_uuid) page_uuid, capture_time
FROM "versions" WHERE "versions"."different" = true AND "versions"."page_uuid" IN (
<PAGE_UUID_LIST>
)
ORDER BY versions.page_uuid, versions.capture_time DESC
) as latest
ON
-- We happen to have an index on capture_time, which makes this join condition fast.
versions.capture_time = latest.capture_time
AND versions.page_uuid = latest.page_uuid; Now the question is how to do that in ActiveRecord. Denormalizing might still be more straightforward, though. |
That said, the above does not work for
Looks like you can pass a SQL string to Version.joins(
<<-QUERY
(
SELECT DISTINCT ON (page_uuid) page_uuid, capture_time
FROM versions
WHERE different = true AND page_uuid IN (
#{page_ids.collect {|id| Version.connection.quote(id)}.join(",")}
)
ORDER BY page_uuid, capture_time DESC
) as latest
ON versions.capture_time = latest.capture_time AND versions.page_uuid = latest.page_uuid
QUERY
) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
The main list view of the UI calls
/api/v0/pages?include_latest=true
, and in the production database, this now takes many seconds to complete. Under the hood, the query that gets the latest version of each page in the result set is the culprit:web-monitoring-db/app/models/page.rb
Lines 35 to 46 in d971f19
This is about as optimized as I can think to make such a query — I’ve experimented with other approaches, CTEs, subqueries, more indexes, etc., but the fundamental problem is that the DB just has to scan too many rows to find the “latest,” no matter what approach we take here.
The UI does this query in order to show when the page last changed. We could resolve the user-facing issue here by removing that column, but it feels like a pretty important piece of data.
Instead, I think it might be time to denormalize some information about the latest version of the page. We could either:
latest_version_time
, orlatest_version_uuid
The former would solve the need to show the capture time in the UI and be the most performant, but the latter would be a more flexible (and compatible) approach, since we’d still be returning a full version object in the
latest
field.The best place to do this would probably be in
Version#sync_page_title
, where we are updating a denormalized title on the page. It’s pretty much the same condition, we’d just be updating another column:web-monitoring-db/app/models/version.rb
Lines 158 to 168 in d971f19
The text was updated successfully, but these errors were encountered: