Denormalize `Page#latest` and `Page#earliest` #858

Mr0grog · 2021-06-18T09:07:57Z

The main list view of the UI calls /api/v0/pages?include_latest=true, and in the production database, this now takes many seconds to complete. Under the hood, the query that gets the latest version of each page in the result set is the culprit:

web-monitoring-db/app/models/page.rb

Lines 35 to 46 in d971f19

    
           has_one :latest, 
        
                   (lambda do 
        
                     # DISTINCT ON requires the first ORDER to be the distinct column(s) 
        
                     relation = order('versions.page_uuid') 
        
                     # HACK: This is not public API, but I couldn't find a better way. The 
        
                     # `DISTINCT ON` statement has to be at the start of the WHERE clause, but 
        
                     # all public methods append to the end. 
        
                     relation.select_values = ['DISTINCT ON (versions.page_uuid) versions.*'] 
        
                     relation.order('versions.capture_time DESC').where(different: true) 
        
                   end), 
        
                   foreign_key: 'page_uuid', 
        
                   class_name: 'Version'

This is about as optimized as I can think to make such a query — I’ve experimented with other approaches, CTEs, subqueries, more indexes, etc., but the fundamental problem is that the DB just has to scan too many rows to find the “latest,” no matter what approach we take here.

The UI does this query in order to show when the page last changed. We could resolve the user-facing issue here by removing that column, but it feels like a pretty important piece of data.

Instead, I think it might be time to denormalize some information about the latest version of the page. We could either:

Add a column with the time of the latest version, e.g. latest_version_time, or
Add a column with the ID of the latest version, and join on it, i.e. latest_version_uuid

The former would solve the need to show the capture time in the UI and be the most performant, but the latter would be a more flexible (and compatible) approach, since we’d still be returning a full version object in the latest field.

The best place to do this would probably be in Version#sync_page_title, where we are updating a denormalized title on the page. It’s pretty much the same condition, we’d just be updating another column:

web-monitoring-db/app/models/version.rb

Lines 158 to 168 in d971f19

    
           def sync_page_title 
        
             if title.present? 
        
               most_recent_capture_time = page.latest.capture_time 
        
               if most_recent_capture_time.nil? || most_recent_capture_time <= capture_time 
        
                 page.update(title: title) 
        
                 return title 
        
               end 
        
             end 
        
             nil 
        
           end

The text was updated successfully, but these errors were encountered:

The "last capture date" column in the list view required loading the latest version for each page, which has *major* performance issues (see edgi-govdata-archiving/web-monitoring-db#858). It can take up to a minute to load! Alleviate these issues for now by just dropping that column and not attempting to load data about the latest version of each page. While we're at it, I added in tags to this view, since we've got newfound empty space and they are probably relevant here.

danielballan · 2021-06-23T19:31:22Z

Denormalizing seems reasonable.
Do both version time and foreign key. Having the version time in their is convenient for sorting, filtering.
Add an index 🙃

Mr0grog · 2021-06-24T03:03:33Z

Tested out adding an index on (page_uuid, capture_time DESC) WHERE different = true, but it sadly doesn’t make any difference. The new index gets used, but only in the same way an index of just page_uuid would get used: it picks all the versions with that page_uuid, then scans that heap and sorts. It doesn't seem clever enough to do all that work just using the index. 😞

Whether I do the current DISTINCT ON (page_uuid) query or use a RANK() function (fancy new thing I learned about a few months back), EXPLAIN ANALYZE always starts with this at the core of the plan:

->  Bitmap Heap Scan on versions  (cost=97.34..4449.32 rows=1145 width=1309) (actual time=0.199..0.992 rows=684 loops=1)
        Recheck Cond: ((page_uuid = ANY ('{<LIST_OF_UUIDS>}'::uuid[])) AND different)
        Heap Blocks: exact=515
        ->  Bitmap Index Scan on test_index_page_uuid_capture_time_different  (cost=0.00..97.05 rows=1145 width=0) (actual time=0.146..0.146 rows=684 loops=1)
            Index Cond: (page_uuid = ANY ('{<LIST_OF_UUIDS>}'::uuid[]))

Mr0grog · 2021-06-24T03:29:57Z

Well, I had a stroke of very twisted brilliance while looking at the EXPLAIN output. If you only select the fields that are in the index, it figures out how to do all the work in one step directly on the index. So you can combine that with a join to get this horrifying but quite speedy approach:

SELECT versions.*
FROM versions
  -- This subquery gets the versions we care about in one step purely by using the new index,
  -- but we need the join because the subquery doesn't have all the other fields we care
  -- about (adding them completely deoptimizes the subquery).
  INNER JOIN (
    SELECT
      DISTINCT ON (page_uuid) page_uuid, capture_time
      FROM "versions" WHERE "versions"."different" = true AND "versions"."page_uuid" IN (
        <PAGE_UUID_LIST>
      )
      ORDER BY versions.page_uuid, versions.capture_time DESC
  ) as latest
  ON
    -- We happen to have an index on capture_time, which makes this join condition fast.
    versions.capture_time = latest.capture_time
    AND versions.page_uuid = latest.page_uuid;

Now the question is how to do that in ActiveRecord. Denormalizing might still be more straightforward, though.

Mr0grog · 2021-06-24T03:48:29Z

That said, the above does not work for earliest instead of latest. Switching from capture_time DESC to ASC deoptimizes and flips it back to not doing all the work on the index.

Now the question is how to do that in ActiveRecord.

Looks like you can pass a SQL string to joins, so this can be something like:

Version.joins(
  <<-QUERY
    (
      SELECT DISTINCT ON (page_uuid) page_uuid, capture_time
      FROM versions
      WHERE different = true AND page_uuid IN (
        #{page_ids.collect {|id| Version.connection.quote(id)}.join(",")}
      )
      ORDER BY page_uuid, capture_time DESC
    ) as latest
    ON versions.capture_time = latest.capture_time AND versions.page_uuid = latest.page_uuid
  QUERY
)

stale · 2022-01-08T22:35:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog added the enhancement label Jun 18, 2021

Mr0grog mentioned this issue Jun 22, 2021

Speed up list view by removing "last capture date" edgi-govdata-archiving/web-monitoring-ui#751

Merged

stale bot added the stale label Jan 8, 2022

stale bot closed this as completed Apr 16, 2022

Mr0grog mentioned this issue Jan 17, 2023

Put this project to rest edgi-govdata-archiving/web-monitoring#168

Closed

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Denormalize `Page#latest` and `Page#earliest` #858

Denormalize `Page#latest` and `Page#earliest` #858

Mr0grog commented Jun 18, 2021

danielballan commented Jun 23, 2021

Mr0grog commented Jun 24, 2021

Mr0grog commented Jun 24, 2021 •

edited

Loading

Mr0grog commented Jun 24, 2021 •

edited

Loading

stale bot commented Jan 8, 2022

Denormalize Page#latest and Page#earliest #858

Denormalize Page#latest and Page#earliest #858

Comments

Mr0grog commented Jun 18, 2021

danielballan commented Jun 23, 2021

Mr0grog commented Jun 24, 2021

Mr0grog commented Jun 24, 2021 • edited Loading

Mr0grog commented Jun 24, 2021 • edited Loading

stale bot commented Jan 8, 2022

Denormalize `Page#latest` and `Page#earliest` #858

Denormalize `Page#latest` and `Page#earliest` #858

Mr0grog commented Jun 24, 2021 •

edited

Loading

Mr0grog commented Jun 24, 2021 •

edited

Loading