Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Denormalize Page#latest and Page#earliest #858

Closed
Mr0grog opened this issue Jun 18, 2021 · 5 comments
Closed

Denormalize Page#latest and Page#earliest #858

Mr0grog opened this issue Jun 18, 2021 · 5 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jun 18, 2021

The main list view of the UI calls /api/v0/pages?include_latest=true, and in the production database, this now takes many seconds to complete. Under the hood, the query that gets the latest version of each page in the result set is the culprit:

has_one :latest,
(lambda do
# DISTINCT ON requires the first ORDER to be the distinct column(s)
relation = order('versions.page_uuid')
# HACK: This is not public API, but I couldn't find a better way. The
# `DISTINCT ON` statement has to be at the start of the WHERE clause, but
# all public methods append to the end.
relation.select_values = ['DISTINCT ON (versions.page_uuid) versions.*']
relation.order('versions.capture_time DESC').where(different: true)
end),
foreign_key: 'page_uuid',
class_name: 'Version'

This is about as optimized as I can think to make such a query — I’ve experimented with other approaches, CTEs, subqueries, more indexes, etc., but the fundamental problem is that the DB just has to scan too many rows to find the “latest,” no matter what approach we take here.

The UI does this query in order to show when the page last changed. We could resolve the user-facing issue here by removing that column, but it feels like a pretty important piece of data.

Instead, I think it might be time to denormalize some information about the latest version of the page. We could either:

  • Add a column with the time of the latest version, e.g. latest_version_time, or
  • Add a column with the ID of the latest version, and join on it, i.e. latest_version_uuid

The former would solve the need to show the capture time in the UI and be the most performant, but the latter would be a more flexible (and compatible) approach, since we’d still be returning a full version object in the latest field.

The best place to do this would probably be in Version#sync_page_title, where we are updating a denormalized title on the page. It’s pretty much the same condition, we’d just be updating another column:

def sync_page_title
if title.present?
most_recent_capture_time = page.latest.capture_time
if most_recent_capture_time.nil? || most_recent_capture_time <= capture_time
page.update(title: title)
return title
end
end
nil
end

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this issue Jun 22, 2021
The "last capture date" column in the list view required loading the latest version for each page, which has *major* performance issues (see edgi-govdata-archiving/web-monitoring-db#858). It can take up to a minute to load! Alleviate these issues for now by just dropping that column and not attempting to load data about the latest version of each page.

While we're at it, I added in tags to this view, since we've got newfound empty space and they are probably relevant here.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this issue Jun 23, 2021
The "last capture date" column in the list view required loading the latest version for each page, which has *major* performance issues (see edgi-govdata-archiving/web-monitoring-db#858). It can take up to a minute to load! Alleviate these issues for now by just dropping that column and not attempting to load data about the latest version of each page.

While we're at it, I added in tags to this view, since we've got newfound empty space and they are probably relevant here.
@danielballan
Copy link
Contributor

  • Denormalizing seems reasonable.
  • Do both version time and foreign key. Having the version time in their is convenient for sorting, filtering.
  • Add an index 🙃

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 24, 2021

Tested out adding an index on (page_uuid, capture_time DESC) WHERE different = true, but it sadly doesn’t make any difference. The new index gets used, but only in the same way an index of just page_uuid would get used: it picks all the versions with that page_uuid, then scans that heap and sorts. It doesn't seem clever enough to do all that work just using the index. 😞

Whether I do the current DISTINCT ON (page_uuid) query or use a RANK() function (fancy new thing I learned about a few months back), EXPLAIN ANALYZE always starts with this at the core of the plan:

->  Bitmap Heap Scan on versions  (cost=97.34..4449.32 rows=1145 width=1309) (actual time=0.199..0.992 rows=684 loops=1)
        Recheck Cond: ((page_uuid = ANY ('{<LIST_OF_UUIDS>}'::uuid[])) AND different)
        Heap Blocks: exact=515
        ->  Bitmap Index Scan on test_index_page_uuid_capture_time_different  (cost=0.00..97.05 rows=1145 width=0) (actual time=0.146..0.146 rows=684 loops=1)
            Index Cond: (page_uuid = ANY ('{<LIST_OF_UUIDS>}'::uuid[]))

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 24, 2021

Well, I had a stroke of very twisted brilliance while looking at the EXPLAIN output. If you only select the fields that are in the index, it figures out how to do all the work in one step directly on the index. So you can combine that with a join to get this horrifying but quite speedy approach:

SELECT versions.*
FROM versions
  -- This subquery gets the versions we care about in one step purely by using the new index,
  -- but we need the join because the subquery doesn't have all the other fields we care
  -- about (adding them completely deoptimizes the subquery).
  INNER JOIN (
    SELECT
      DISTINCT ON (page_uuid) page_uuid, capture_time
      FROM "versions" WHERE "versions"."different" = true AND "versions"."page_uuid" IN (
        <PAGE_UUID_LIST>
      )
      ORDER BY versions.page_uuid, versions.capture_time DESC
  ) as latest
  ON
    -- We happen to have an index on capture_time, which makes this join condition fast.
    versions.capture_time = latest.capture_time
    AND versions.page_uuid = latest.page_uuid;

Now the question is how to do that in ActiveRecord. Denormalizing might still be more straightforward, though.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 24, 2021

That said, the above does not work for earliest instead of latest. Switching from capture_time DESC to ASC deoptimizes and flips it back to not doing all the work on the index.

Now the question is how to do that in ActiveRecord.

Looks like you can pass a SQL string to joins, so this can be something like:

Version.joins(
  <<-QUERY
    (
      SELECT DISTINCT ON (page_uuid) page_uuid, capture_time
      FROM versions
      WHERE different = true AND page_uuid IN (
        #{page_ids.collect {|id| Version.connection.quote(id)}.join(",")}
      )
      ORDER BY page_uuid, capture_time DESC
    ) as latest
    ON versions.capture_time = latest.capture_time AND versions.page_uuid = latest.page_uuid
  QUERY
)

@stale
Copy link

stale bot commented Jan 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants