Improving deletes performance by limiting number of records to scan #12007

mateusz-sekara · 2024-02-13T09:00:33Z

Motivation

We've noticed that Postgres has performance issues when running delete queries during reorgs. Although, the reorgs we noticed were very shallow (e.g. up to 10 blocks), Postgres wasn't able to come up with an efficient execution plan. It seems that the large number of logs in the table (~ 12kk in entire evm.logs), caused problems with estimating how many records will be fetched during that query.
Based on different experiments, applying an upper filter helps Postgres a lot. Besides that, we've also tested the brin index, but it didn't help in that case.

State of the environment during tests

=> select count(*) from evm.logs where evm_chain_id = 11155111;
 4377938
=> select count(*) from evm.log_poller_blocks where evm_chain_id = 11155111;
 100160

Deleting 100 logs from `evm.logs`

Lack of an upper limit impacts Postgres' query analyzer in a way in which it can't properly estimate the dataset size. Postgres decides to sequentially scan the entire index for specific evm_chain_id to gather matching logs for deletion. (upper bound limit for tests is just the latest (max) block_number from the db)

Before

=> begin;
explain analyze delete from evm.logs where evm_chain_id = 11155111 and block_number >= 5269551;
rollback;
BEGIN
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Delete on logs  (cost=0.00..855138.32 rows=2721241 width=6) (actual time=2645.049..2645.050 rows=0 loops=1)
   ->  Seq Scan on logs  (cost=0.00..855138.32 rows=2721241 width=6) (actual time=2638.169..2644.723 rows=87 loops=1)
         Filter: ((block_number >= 5269551) AND (evm_chain_id = '11155111'::numeric))
         Rows Removed by Filter: 12961268
 Planning Time: 0.114 ms
 Execution Time: 2645.075 ms
(6 rows)

After

=> begin;
explain analyze delete from evm.logs where evm_chain_id = 11155111 and block_number >= 5269551 and block_number <= 5269651;
rollback;
BEGIN
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Delete on logs  (cost=0.56..153.09 rows=41 width=6) (actual time=0.429..0.429 rows=0 loops=1)
   ->  Index Scan using evm_logs_idx on logs  (cost=0.56..153.09 rows=41 width=6) (actual time=0.049..0.109 rows=87 loops=1)
         Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= 5269651))
 Planning Time: 0.119 ms
 Execution Time: 0.450 ms
(5 rows)

After (nested query with upper bound)

begin;
explain analyze delete from evm.logs where evm_chain_id = 11155111 and block_number >= 5269551 and block_number <= ((SELECT MAX(block_number) FROM evm.logs WHERE evm_chain_id = 11155111));
rollback;
BEGIN
                                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on logs  (cost=1295.57..75840.88 rows=21790 width=6) (actual time=0.482..0.483 rows=0 loops=1)
   InitPlan 2 (returns $1)
     ->  Result  (cost=1.18..1.19 rows=1 width=8) (actual time=0.056..0.056 rows=1 loops=1)
           InitPlan 1 (returns $0)
             ->  Limit  (cost=0.56..1.18 rows=1 width=8) (actual time=0.054..0.054 rows=1 loops=1)
                   ->  Index Only Scan Backward using evm_logs_idx on logs logs_1  (cost=0.56..2681558.66 rows=4358040 width=8) (actual time=0.053..0.053 rows=1 loops=1)
                         Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number IS NOT NULL))
                         Heap Fetches: 1
   ->  Bitmap Heap Scan on logs  (cost=1294.38..75839.69 rows=21790 width=6) (actual time=0.125..0.158 rows=87 loops=1)
         Recheck Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= $1))
         Heap Blocks: exact=9
         ->  Bitmap Index Scan on evm_logs_idx  (cost=0.00..1288.93 rows=21790 width=0) (actual time=0.114..0.114 rows=87 loops=1)
               Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= $1))
 Planning Time: 0.311 ms
 Execution Time: 0.525 ms
(15 rows)

Deleting 100 blocks from `evm.log_poller_blocks`

There is no big difference for evm.log_poller_block, probably because of the smaller dataset compared to logs. However, based on the query plans, you might spot that with an upper limit, it's easier for Postgres to pick the proper index, and query plan is more efficient. Also, the estimated cost during planning is lower.

Before

=> begin;

explain analyze delete from evm.log_poller_blocks where evm_chain_id = 11155111 and block_number >= 5269551;

rollback;
BEGIN
                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on log_poller_blocks  (cost=2417.10..18855.81 rows=94114 width=6) (actual time=0.436..0.436 rows=0 loops=1)
   ->  Bitmap Heap Scan on log_poller_blocks  (cost=2417.10..18855.81 rows=94114 width=6) (actual time=0.068..0.093 rows=101 loops=1)
         Recheck Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551))
         Heap Blocks: exact=2
         ->  Bitmap Index Scan on idx_evm_log_poller_blocks_order_by_block  (cost=0.00..2393.57 rows=94114 width=0) (actual time=0.047..0.048 rows=101 loops=1)
               Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551))
 Planning Time: 0.091 ms
 Execution Time: 0.463 ms
(8 rows)

After

=> begin;

explain analyze delete from evm.log_poller_blocks where evm_chain_id = 11155111 and block_number >= 5269551 and block_number <= 5269651;

rollback;
BEGIN
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on log_poller_blocks  (cost=0.43..8.45 rows=1 width=6) (actual time=0.469..0.469 rows=0 loops=1)
   ->  Index Scan using idx_evm_log_poller_blocks_order_by_block on log_poller_blocks  (cost=0.43..8.45 rows=1 width=6) (actual time=0.061..0.095 rows=101 loops=1)
         Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= 5269651))
 Planning Time: 0.450 ms
 Execution Time: 0.499 ms
(5 rows)

After (nested query)

begin;
explain analyze delete from evm.log_poller_blocks where evm_chain_id = 11155111 and block_number >= 5269551 and block_number <= (select max(block_number) from evm.log_poller_blocks where evm_chain_id = 11155111);
rollback;

                                                                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on log_poller_blocks  (cost=16.01..1764.06 rows=512 width=6) (actual time=0.507..0.508 rows=0 loops=1)
   InitPlan 2 (returns $1)
     ->  Result  (cost=1.04..1.05 rows=1 width=8) (actual time=0.065..0.066 rows=1 loops=1)
           InitPlan 1 (returns $0)
             ->  Limit  (cost=0.43..1.04 rows=1 width=8) (actual time=0.063..0.064 rows=1 loops=1)
                   ->  Index Only Scan using idx_evm_log_poller_blocks_order_by_block on log_poller_blocks log_poller_blocks_1  (cost=0.43..63122.37 rows=102319 width=8) (actual time=0.062..0.062 rows=1 loops=1)
                         Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number IS NOT NULL))
                         Heap Fetches: 1
   ->  Bitmap Heap Scan on log_poller_blocks  (cost=14.96..1763.01 rows=512 width=6) (actual time=0.095..0.121 rows=101 loops=1)
         Recheck Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= $1))
         Heap Blocks: exact=2
         ->  Bitmap Index Scan on idx_evm_log_poller_blocks_order_by_block  (cost=0.00..14.83 rows=512 width=0) (actual time=0.087..0.087 rows=101 loops=1)
               Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= $1))
 Planning Time: 0.159 ms
 Execution Time: 0.546 ms
(15 rows)

github-actions · 2024-02-13T09:00:51Z

I see that you haven't updated any CHANGELOG files. Would it make sense to do so?

mateusz-sekara · 2024-02-13T09:51:50Z

core/chains/evm/logpoller/orm.go

 		if err != nil {
 			o.lggr.Warnw("Unable to clear reorged blocks, retrying", "err", err)
 			return err
 		}

-		_, err = tx.NamedExec(`DELETE FROM evm.logs WHERE block_number >= :start_block AND evm_chain_id = :evm_chain_id`, args)
+		_, err = tx.NamedExec(`DELETE FROM evm.logs 


Alternative approach would be to get latest block from logs in nested query (without explicit call to o.SelectLatestBlock(pg.WithQueryer(tx)))

DELETE FROM evm.logs WHERE evm_chain_id = :evm_chain_id AND block_number >= :start_block AND block_number <= (SELECT MAX(block_number) FROM evm.logs WHERE evm_chain_id = :evm_chain_id)

Seems to have similar performance, the only difference is that it's more defensive because it gets latest block from logs, instead of log_poller_block. This will handle all the cases when the block is not persisted for some reason (not sure how real this scenario is)
Previous implementation takes a block from evm.log_poller_blocks

=> begin; explain analyze DELETE FROM evm.logs WHERE evm_chain_id = 11155111 AND block_number >= 5269551 AND block_number <= (SELECT MAX(block_number) FROM evm.logs WHERE evm_chain_id = 11155111); rollback; BEGIN QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Delete on logs (cost=1295.57..75840.88 rows=21790 width=6) (actual time=0.425..0.426 rows=0 loops=1) InitPlan 2 (returns $1) -> Result (cost=1.18..1.19 rows=1 width=8) (actual time=0.031..0.032 rows=1 loops=1) InitPlan 1 (returns $0) -> Limit (cost=0.56..1.18 rows=1 width=8) (actual time=0.030..0.030 rows=1 loops=1) -> Index Only Scan Backward using evm_logs_idx on logs logs_1 (cost=0.56..2681558.66 rows=4358040 width=8) (actual time=0.029..0.029 rows=1 loops=1) Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number IS NOT NULL)) Heap Fetches: 1 -> Bitmap Heap Scan on logs (cost=1294.38..75839.69 rows=21790 width=6) (actual time=0.072..0.106 rows=87 loops=1) Recheck Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= $1)) Heap Blocks: exact=9 -> Bitmap Index Scan on evm_logs_idx (cost=0.00..1288.93 rows=21790 width=0) (actual time=0.056..0.057 rows=87 loops=1) Index Cond: ((evm_chain_id = '11155111'::numeric) AND (block_number >= 5269551) AND (block_number <= $1)) Planning Time: 0.242 ms Execution Time: 0.460 ms (15 rows)

Yeah, I think this way is safer... and maybe even more efficient since it's only 1 network request instead of 2.

reductionista · 2024-02-13T19:40:34Z

core/chains/evm/logpoller/orm_test.go

@@ -1435,7 +1435,7 @@ func TestInsertLogsInTx(t *testing.T) {
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
 			// clean all logs and blocks between test cases
-			defer func() { _ = o.DeleteLogsAndBlocksAfter(0) }()
+			defer func() { _, _ = db.NamedExec("truncate evm.logs", struct{}{}) }()


db.Exec("truncate evm.logs") would be simpler

reductionista · 2024-02-13T19:58:08Z

I'm really surprised that adding the bounds made a difference in the plan, as the planner is supposed to consult the pg_stats table which includes a histogram of values in each table. In principle, it should know what the range of values is in the table... I can't argue with the EXPLAIN ANALYZE though, it clearly makes a difference somehow.

cl-sonarqube-production · 2024-02-14T08:17:29Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

…#12007) (#506) Please see more details here smartcontractkit/chainlink#12007

* develop: (74 commits) VRF zero confirmation delay (#11947) add toml configs to paths that can cause e2e tests to run in ci (#12001) bump golang.org/x/... (#12042) [chore] Replace clock with specialized lib (#12031) Update style guide (#12041) plugins/cmd/chainlink-mercury: (re)move to chainlink-data-streams repo (#11994) bump go-plugin (#12033) Adds timeout on fuzz script execution (#12024) Add bytes type to abi_type (#12029) AUTO-8804: create chain specific modules for l1 gas calculations (#11896) VRF-878 Gas Optimization V2 Plus (#11982) Improving deletes performance by limiting number of records to scan (#12007) core/web: improve health CLI readabilty (#12021) Handle a 0 exit code from the remote runner instead of always failing (#12015) Add a simple Codec test (#12006) Allow for custom config poller onchain codec (LLO support) (#11957) Update Sonar properties (#11986) golangci-lint: revive: add early-return; fix issues (#12017) Implement NewPluginProvider (EVM) (#11995) Fix lock file version and minor NPM bumps (#11980) ...

…12007) * Improving deletes performance by limiting number of records to scan * Post review fixes - replacing direct latestBlock call with nested query

mateusz-sekara temporarily deployed to sdlc February 13, 2024 09:00 — with GitHub Actions Inactive

mateusz-sekara force-pushed the lp-delete-query-fix branch from f60f1a0 to 00b6016 Compare February 13, 2024 09:01

mateusz-sekara temporarily deployed to sdlc February 13, 2024 09:01 — with GitHub Actions Inactive

mateusz-sekara commented Feb 13, 2024

View reviewed changes

mateusz-sekara force-pushed the lp-delete-query-fix branch from 00b6016 to a9ed8d6 Compare February 13, 2024 10:01

mateusz-sekara temporarily deployed to sdlc February 13, 2024 10:01 — with GitHub Actions Inactive

mateusz-sekara requested review from reductionista and connorwstein February 13, 2024 10:01

Improving deletes performance by limiting number of records to scan

43dd3c1

mateusz-sekara force-pushed the lp-delete-query-fix branch from a9ed8d6 to 43dd3c1 Compare February 13, 2024 10:13

mateusz-sekara temporarily deployed to sdlc February 13, 2024 10:13 — with GitHub Actions Inactive

mateusz-sekara marked this pull request as ready for review February 13, 2024 10:17

mateusz-sekara requested a review from a team as a code owner February 13, 2024 10:17

reductionista reviewed Feb 13, 2024

View reviewed changes

mateusz-sekara temporarily deployed to sdlc February 13, 2024 19:58 — with GitHub Actions Inactive

mateusz-sekara requested a review from reductionista February 13, 2024 19:58

Post review fixes - replacing direct latestBlock call with nested query

e4d83ac

mateusz-sekara force-pushed the lp-delete-query-fix branch from 35dd292 to e4d83ac Compare February 13, 2024 19:58

mateusz-sekara temporarily deployed to sdlc February 13, 2024 19:58 — with GitHub Actions Inactive

reductionista approved these changes Feb 13, 2024

View reviewed changes

mateusz-sekara enabled auto-merge February 14, 2024 08:03

mateusz-sekara disabled auto-merge February 14, 2024 08:03

Merge branch 'develop' into lp-delete-query-fix

727fada

mateusz-sekara temporarily deployed to sdlc February 14, 2024 08:03 — with GitHub Actions Inactive

mateusz-sekara enabled auto-merge February 14, 2024 08:04

mateusz-sekara added this pull request to the merge queue Feb 14, 2024

Merged via the queue into develop with commit a3bd1f7 Feb 14, 2024
93 checks passed

mateusz-sekara deleted the lp-delete-query-fix branch February 14, 2024 08:56

mateusz-sekara mentioned this pull request Feb 14, 2024

Improving deletes performance by limiting number of records to scan (#12007) smartcontractkit/ccip#506

Merged

mateusz-sekara added a commit to smartcontractkit/ccip that referenced this pull request Feb 14, 2024

Improving deletes performance by limiting number of records to scan (…

325f9a8

…#12007) (#506) Please see more details here smartcontractkit/chainlink#12007

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving deletes performance by limiting number of records to scan #12007

Improving deletes performance by limiting number of records to scan #12007

mateusz-sekara commented Feb 13, 2024 •

edited

Loading

github-actions bot commented Feb 13, 2024

mateusz-sekara Feb 13, 2024 •

edited

Loading

reductionista Feb 13, 2024

reductionista Feb 13, 2024

reductionista commented Feb 13, 2024

cl-sonarqube-production bot commented Feb 14, 2024

Improving deletes performance by limiting number of records to scan #12007

Improving deletes performance by limiting number of records to scan #12007

Conversation

mateusz-sekara commented Feb 13, 2024 • edited Loading

Motivation

State of the environment during tests

Deleting 100 logs from evm.logs

Deleting 100 blocks from evm.log_poller_blocks

github-actions bot commented Feb 13, 2024

mateusz-sekara Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

reductionista Feb 13, 2024

Choose a reason for hiding this comment

reductionista Feb 13, 2024

Choose a reason for hiding this comment

reductionista commented Feb 13, 2024

cl-sonarqube-production bot commented Feb 14, 2024

Quality Gate passed

mateusz-sekara commented Feb 13, 2024 •

edited

Loading

Deleting 100 logs from `evm.logs`

Deleting 100 blocks from `evm.log_poller_blocks`

mateusz-sekara Feb 13, 2024 •

edited

Loading