introduces a faster query to tuple GC #1859

vroldanbet · 2024-04-08T09:51:38Z

we've noticed that even after #1550 was introduced and made the nested GC SELECT query faster, deletions are still causing spikes in postgres
and Aurora instances.

The proposed query seemed 1 order of magnitude less cost that the current query on a test PG 15.6 with 5M dead tuples and 5M alive tuples.

Before

WITH rows AS (SELECT namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, created_xid, deleted_xid FROM relation_tuple WHERE deleted_xid < '901' LIMIT 1000)
DELETE FROM relation_tuple
WHERE (namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, created_xid, deleted_xid) IN (SELECT namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, created_xid, deleted_xid FROM rows);

| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Delete on relation_tuple  (cost=65.94..8656.38 rows=0 width=0) (actual time=14.375..14.376 rows=0 loops=1)                                                                                                                                                                                                                                                                                                     |
|   ->  Nested Loop  (cost=65.94..8656.38 rows=1000 width=110) (actual time=1.100..14.026 rows=1000 loops=1)                                                                                                                                                                                                                                                                                                     |
|         ->  HashAggregate  (cost=65.38..75.38 rows=1000 width=184) (actual time=1.061..1.319 rows=1000 loops=1)                                                                                                                                                                                                                                                                                                |
|               Group Key: (rows.namespace)::text, (rows.object_id)::text, (rows.relation)::text, (rows.userset_namespace)::text, (rows.userset_object_id)::text, (rows.userset_relation)::text, rows.created_xid, rows.deleted_xid                                                                                                                                                                              |
|               Batches: 1  Memory Usage: 841kB                                                                                                                                                                                                                                                                                                                                                                  |
|               ->  Subquery Scan on rows  (cost=0.43..45.38 rows=1000 width=184) (actual time=0.150..0.647 rows=1000 loops=1)                                                                                                                                                                                                                                                                                   |
|                     ->  Limit  (cost=0.43..35.38 rows=1000 width=80) (actual time=0.143..0.488 rows=1000 loops=1)                                                                                                                                                                                                                                                                                              |
|                           ->  Index Scan using ix_gc_index on relation_tuple relation_tuple_1  (cost=0.43..178606.61 rows=5110010 width=80) (actual time=0.142..0.444 rows=1000 loops=1)                                                                                                                                                                                                                       |
|                                 Index Cond: (deleted_xid < '901'::xid8)                                                                                                                                                                                                                                                                                                                                        |
|         ->  Index Scan using uq_relation_tuple_living_xid on relation_tuple  (cost=0.56..8.57 rows=1 width=86) (actual time=0.012..0.012 rows=1 loops=1000)                                                                                                                                                                                                                                                    |
|               Index Cond: (((namespace)::text = (rows.namespace)::text) AND ((object_id)::text = (rows.object_id)::text) AND ((relation)::text = (rows.relation)::text) AND ((userset_namespace)::text = (rows.userset_namespace)::text) AND ((userset_object_id)::text = (rows.userset_object_id)::text) AND ((userset_relation)::text = (rows.userset_relation)::text) AND (deleted_xid = rows.deleted_xid)) |
|               Filter: (rows.created_xid = created_xid)                                                                                                                                                                                                                                                                                                                                                         |
| Planning Time: 0.674 ms                                                                                                                                                                                                                                                                                                                                                                                        |
| Execution Time: 14.464 ms                                                                                                                                                                                                                                                                                                                                                                                      |

After

WITH rows AS (SELECT ctid FROM relation_tuple WHERE deleted_xid < '901' LIMIT 1000)
DELETE FROM relation_tuple
WHERE ctid IN (SELECT ctid FROM rows);

| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Delete on relation_tuple  (cost=47.89..854.38 rows=0 width=0) (actual time=3.598..3.598 rows=0 loops=1)                                                                                 |
|   ->  Nested Loop  (cost=47.89..854.38 rows=1000 width=36) (actual time=2.682..3.237 rows=1000 loops=1)                                                                                 |
|         ->  HashAggregate  (cost=47.88..49.88 rows=200 width=36) (actual time=2.679..2.758 rows=1000 loops=1)                                                                           |
|               Group Key: rows.ctid                                                                                                                                                      |
|               Batches: 1  Memory Usage: 209kB                                                                                                                                           |
|               ->  Subquery Scan on rows  (cost=0.43..45.38 rows=1000 width=36) (actual time=0.193..2.503 rows=1000 loops=1)                                                             |
|                     ->  Limit  (cost=0.43..35.38 rows=1000 width=6) (actual time=0.183..2.406 rows=1000 loops=1)                                                                        |
|                           ->  Index Scan using ix_gc_index on relation_tuple relation_tuple_1  (cost=0.43..178606.61 rows=5110010 width=6) (actual time=0.182..2.363 rows=1000 loops=1) |
|                                 Index Cond: (deleted_xid < '901'::xid8)                                                                                                                 |
|         ->  Tid Scan on relation_tuple  (cost=0.00..4.01 rows=1 width=6) (actual time=0.000..0.000 rows=1 loops=1000)                                                                   |
|               TID Cond: (ctid = rows.ctid)                                                                                                                                              |
| Planning Time: 3.335 ms                                                                                                                                                                 |
| Execution Time: 3.706 ms                                                                                                                                                                |

vroldanbet · 2024-04-09T09:21:19Z

I was investigating if ctid is stable within a CTE, which led me to this post that describes a gotcha on partitioned tables:

There actually is a serious pitfall with using ctid for identifying rows to delete: in partitioned tables, ctid is not unique. It is only unique in a given partition so, if one uses it as a criterion for deletion, this will likely result in deleting the desired row as well as the otherwise totally unrelated rows from other partitions (ask me how I know :-/ ).

I ran the query with tableoid and it seems to work just fine, it simply adds another filter phase which shouldn't add much overhead unless the same CTID is found in multiple tables. So I think this approach is good seems fine.

Delete on relation_tuple  (cost=72.12..879.11 rows=0 width=0) (actual time=1.427..1.428 rows=0 loops=1)
"  ->  Nested Loop  (cost=72.12..879.11 rows=500 width=40) (actual time=0.569..1.157 rows=1000 loops=1)"
"        ->  HashAggregate  (cost=72.11..74.11 rows=200 width=44) (actual time=0.565..0.648 rows=1000 loops=1)"
"              Group Key: rows.tableoid, rows.ctid"
"              Batches: 1  Memory Usage: 209kB"
"              ->  Subquery Scan on rows  (cost=0.00..67.11 rows=1000 width=44) (actual time=0.024..0.407 rows=1000 loops=1)"
"                    ->  Limit  (cost=0.00..57.11 rows=1000 width=10) (actual time=0.020..0.315 rows=1000 loops=1)"
"                          ->  Seq Scan on relation_tuple relation_tuple_1  (cost=0.00..229045.70 rows=4010286 width=10) (actual time=0.020..0.272 rows=1000 loops=1)"
"                                Filter: (deleted_xid < '901'::xid8)"
"        ->  Tid Scan on relation_tuple  (cost=0.00..4.02 rows=1 width=10) (actual time=0.000..0.000 rows=1 loops=1000)"
"              TID Cond: (ctid = rows.ctid)"
"              Filter: (rows.tableoid = tableoid)"
Planning Time: 0.106 ms
Execution Time: 1.470 ms

josephschorr · 2024-04-10T12:10:20Z

@vroldanbet Is there a test we can add for GC along a partitioned table to ensure the use of the table ID fixes this potential issue?

vroldanbet · 2024-04-10T12:30:24Z

@vroldanbet Is there a test we can add for GC along a partitioned table to ensure the use of the table ID fixes this potential issue?

We cannot control tableoid and ctid, those are internal columns. From the docs:

What the docs say about tableoid:

The OID of the table containing this row. This column is particularly handy for queries that select from partitioned tables or inheritance hierarchies, since without it, it's difficult to tell which individual table a row came from. The tableoid can be joined against the oid column of pg_class to obtain the table name.

What the docs say about ctid:

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Testing a scenario where two different relationships to be GCed have the same ctid would require us to:

add a new type of test suite that partitions the main tables defined in the datastore. Something we should do at some point, but should be probably tackled into its own PR
generate a state of the database where 2 CTIDs are the same across partitions. I'm not sure how likely that is to happen without understanding how it's implemented. Docs say its "the location in the table", so it does not sound like "randomly generated". We'd have to create a test where we have a revision marked as deleted in one partition and a revision that is alive in another partition, both with the same CTID.

My interpretation of the document is that tableoid, ctid provides a globally unique internal identification to the row.

we've noticed that even after #1550 was introduced and made the nested GC SELECT query faster, deletions are still causing spikes in postgres and Aurora instances. The proposed query seemed 1 order of magnitude less cost that the current query on a test PG 15.6 with 5M dead tuples and 5M alive tuples.

josephschorr

LGTM

github-actions bot added the area/datastore Affects the storage system label Apr 8, 2024

vroldanbet force-pushed the faster-postgres-gc branch from c167656 to 3e62f68 Compare April 10, 2024 10:52

vroldanbet marked this pull request as ready for review April 10, 2024 10:53

vroldanbet requested a review from a team April 10, 2024 10:53

vroldanbet force-pushed the faster-postgres-gc branch from 3e62f68 to c4a4b44 Compare April 10, 2024 11:12

vroldanbet self-assigned this Apr 10, 2024

vroldanbet requested a review from ecordell April 10, 2024 11:17

vroldanbet force-pushed the faster-postgres-gc branch from c4a4b44 to 9f45855 Compare April 10, 2024 12:36

josephschorr approved these changes Apr 10, 2024

View reviewed changes

vroldanbet added this pull request to the merge queue Apr 10, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 10, 2024

vroldanbet added this pull request to the merge queue Apr 10, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 10, 2024

vroldanbet added this pull request to the merge queue Apr 10, 2024

Merged via the queue into main with commit 095ffd3 Apr 10, 2024
22 checks passed

vroldanbet deleted the faster-postgres-gc branch April 10, 2024 17:00

github-actions bot locked and limited conversation to collaborators Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduces a faster query to tuple GC #1859

introduces a faster query to tuple GC #1859

vroldanbet commented Apr 8, 2024

vroldanbet commented Apr 9, 2024 •

edited

Loading

josephschorr commented Apr 10, 2024

vroldanbet commented Apr 10, 2024

josephschorr left a comment

introduces a faster query to tuple GC #1859

introduces a faster query to tuple GC #1859

Conversation

vroldanbet commented Apr 8, 2024

Before

After

vroldanbet commented Apr 9, 2024 • edited Loading

josephschorr commented Apr 10, 2024

vroldanbet commented Apr 10, 2024

josephschorr left a comment

Choose a reason for hiding this comment

vroldanbet commented Apr 9, 2024 •

edited

Loading