Make LogPoller more robust against local finality violations #12605

reductionista · 2024-03-26T21:57:11Z

The main focus of this PR is to help LogPoller detect, prevent "local finality violations". These are cases where the multinode layer fails over from one rpc several to another and the second one is behind on its view of the chain. This can make it look like there is a finality violation even when there was not any global finality violation on the chain.

Primary change:

Checks that all results from batchFetchBlocks() are finalized aside from "latest"
batchFetchBlocks() will now fetch the "finalized" block along with the rest of each batch,
and validate that all of the block numbers (aside from the special case when "latest" is requested)
are <= the finalized block number returned.

Also in this PR are two refactors for reducing code complexity:

Change backfill() to always save the last block of each batch of logs requested, rather than the last block of the logs returned.
(This only makes a difference if the last block requested has no logs matching the filter, but this change will allow us to eliminate
lastSafeBlockNumber = latestFinalizedBlock - 1 in an upcoming PR in favor of latestFinalizedBlock which simplifies the overall LogPoller
implementation. It also gets us a step closer to being able to use "finalized" for the upper range of the final batch request for logs, but that
comes with some additional complexities which still need to be worked out.)
Start Backup LogPoller on lastProcessed.FinalizedBlockNumber instead of lastProcessed.FinalizedBlockNumber - 1. This was a harmless "bug" where it was starting one block too early, but entirely separate from the previous change.
Minor refactor to remove code duplication (condensing 3 replicated code blocks calling getCurrentBlockMaybeHandleReorg down to 2)

github-actions · 2024-03-26T21:57:32Z

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

#nops : For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#added : For any new functionality added.
#changed : For any change to the existing functionality. 
#removed : For any functionality/config that is removed.
#updated : For any functionality that is updated.
#deprecation_notice : For any upcoming deprecation functionality.
#breaking_change : For any functionality that requires manual action for the node to boot.
#db_update : For any feature that introduces updates to database schema.
#wip : For any change that is not ready yet and external communication about it should be held off till it is feature complete.
#bugfix - For bug fixes.
#internal - For changesets that need to be excluded from the final changelog.

github-actions · 2024-04-16T01:40:47Z

I see you added a changeset file but it does not contain a tag. Please edit the text include at least one of the following tags:

#nops : For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#added : For any new functionality added.
#changed : For any change to the existing functionality. 
#removed : For any functionality/config that is removed.
#updated : For any functionality that is updated.
#deprecation_notice : For any upcoming deprecation functionality.
#breaking_change : For any functionality that requires manual action for the node to boot.
#db_update : For any feature that introduces updates to database schema.
#wip : For any change that is not ready yet and external communication about it should be held off till it is feature complete.
#bugfix - For bug fixes.
#internal - For changesets that need to be excluded from the final changelog.

mateusz-sekara

I'm not sure if I follow. How does validating whether blocks are finalized fix the issues with local finality violations? I think I miss the bigger picture here

core/chains/evm/logpoller/log_poller.go

reductionista · 2024-04-23T18:05:40Z

I'm not sure if I follow. How does validating whether blocks are finalized fix the issues with local finality violations? I think I miss the bigger picture here

Sorry, I should have provided more explanation on that in the PR description. The flow of PollAndSaveLogs() includes this sequence of calls:

latestBlock, latestFinalizedBlockNumber, err := lp.latestBlocks(lp.ctx)
currentBlock, err = lp.getCurrentBlockMaybeHandleReorg(ctx, currentBlockNumber, currentBlock)
lastSafeBackfillBlock := latestFinalizedBlockNumber - 1
lp.backfill(ctx, currentBlockNumber, lastSafeBackfillBlock)
4a. gethLogs, err := lp.ec.FilterLogs(ctx, lp.Filter(big.NewInt(from), big.NewInt(to), nil))
4b. blocks, err := lp.blocksFromLogs(ctx, gethLogs)
4c. lp.orm.InsertLogsWithBlock(ctx, convertLogs(gethLogs, blocks, lp.lggr, lp.ec.ConfiguredChainID()), blocks[len(blocks)-1])

(omitting some inconsequential steps, these are the main ones)

The scenario where a local finality violation can cause problems is when the primary rpc server returns latestFinalizedBlockNumber in step 1, then there is a failover to an out-of-sync rpc server somewhere between steps 1 and 4b. In step 4b, blocksFromLogs() calls GetBlocksRange() to fetch all of the blocks associated with the list of logs passed to it in parallel. If it's connected to the out-of-sync rpc server at this point, then some or all of the blocks it gets back may be unfinalized blocks... they may not even be part of the canonical chain. If the last block get saved to the db as a finalized block in step 4c but it's actually not finalized, then we're in a situation where the db is corrupt and this can easily cause the node to get stuck. For example, if the primary rpc server comes back up, and now it's receiving blocks from the canonical chain, it may not be able to verify that those have any connection to the blocks in the db which it trusts because it thinks they are all "finalized".

The bottom line is, we only ever save finalized blocks to the db... the rpc failover scenario opened a loophole where we could save an unfinalized block for a block number we were previously told was finalized. This mostly closes that loophole. (The only potential edge case is if the requests sent in a single batch get executed out of order, and one or more of the blocks we're requesting gets re-org'd out and then finalized while the batch in being processed. That seems very unlikely, but we have another fix coming soon at the lower (multinode layer) level that should cover that scenario too.)

I was originally thinking mostly about fixing things in step 4a, where it requests the logs. If the failover happens between steps 1 and 4a then it can get back unfinalized logs, which is also a problem. But this fix should take care of both that situation and one where it fails over between steps 4a and 4b, because if blocksFromLogs returns an error then it won't proceed to step 4c so neither the unfinalized logs nor the unfinalized block will get saved to the db. It will just retry everything again next time around.

getCurrentBlockMaybeHandleReorg is called just before the for loop over unfinalized blocks begins, and at the end of each iteration. Simplifying by moving them both to the beginning of the for loop

This fixes 2 bugs on develop branch in this test, and removes some unused commented code. First Bug ========= The first bug was causing a false positive PASS on develop branch, which was obscuring a (very minor) bug in BackupPoller that's been fixed in this PR. The comment here about what the test was intended to test is still correct: // Only the 2nd batch + 1 log from a previous batch should be backfilled, because we perform backfill starting // from one block behind the latest finalized block Contrary to the comment, the code was returning 2 logs from the 1st batch (Data=9 & Data=10), plus 9 of 10 logs from the 2nd batch. This was incorrect behavior, but the test was also checking for the same incorrect behavior (looking for 11 logs with first one being Data=9) instead of what's described in the comment. The bug in the production code was that it starts the Backup Poller at Finalized - 1 instead of Finalized. This is a harmless "bug", just unnecessarily starting a block too early, since there's no reason for backup logpoller to re-request the same finalized logs that's already been processed. Now, the code returns the last log from the 1st batch + all but one logs from the 2nd batch, which is correct. (It can't return the last log because that goes beyond the last safe block.) So the test checks that there are 10 logs with first one being Data=10 (last log from the first batch.) Second Bug ========== The second bug was passing firstBatchBlock and secondBatchBlock directly to markBlockAsFinalized() instead of passing firstBatchBlock - 1 and secondBatchBlock - 1. This was only working because of a bug in the version of geth we're currently using: when you request the pending block from simulated geth, it gives you back the current block (1 block prior) instead of the current block. (For example, in the first case, even though we wanted block 11, the latest current block, we request block 12 and get back block 11.) This has been fixed in the latest version of geth... so presumably if we don't fix this here the test would have started failing as soon as we upgrade to the latest version of geth. It doesn't change any behavior of the test for the present version of geth, just makes it more clear that we want block 11 not 12.

…om "latest" batchFetchBlocks() will now fetch the "finalized" block along with the rest of each batch, and validate that all of the block numbers (aside from the special when "lateest" is requested) are <= the finalized block number returned. Also, change backfill() to always save the last block of each batch of logs requested, rather than the last block of the logs returned. This only makes a difference if the last block requested has no logs matching the filter, but this change is essential for being able to safely change lastSafeBlockNumber from latestFinalizedBlock - 1 to latestFinalizedBlock

cl-sonarqube-production · 2024-04-24T20:14:27Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
98.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

mateusz-sekara · 2024-04-25T10:04:53Z

core/chains/evm/logpoller/log_poller_internal_test.go

@@ -554,6 +599,77 @@ func Test_latestBlockAndFinalityDepth(t *testing.T) {
 	})
 }

+func Test_FetchBlocks(t *testing.T) {


reductionista temporarily deployed to sdlc March 26, 2024 21:57 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from eecd637 to e65effb Compare April 2, 2024 02:24

reductionista temporarily deployed to sdlc April 2, 2024 02:24 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from e65effb to 7a59051 Compare April 10, 2024 18:25

reductionista temporarily deployed to sdlc April 10, 2024 18:49 — with GitHub Actions Inactive

reductionista temporarily deployed to sdlc April 11, 2024 01:19 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from 15cb3bf to 4e434a9 Compare April 11, 2024 01:19

reductionista temporarily deployed to sdlc April 11, 2024 01:19 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from 4e434a9 to e0869c3 Compare April 16, 2024 01:36

reductionista temporarily deployed to sdlc April 16, 2024 01:36 — with GitHub Actions Inactive

reductionista temporarily deployed to sdlc April 16, 2024 01:40 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from 612c78c to 72a93a4 Compare April 16, 2024 01:50

reductionista temporarily deployed to sdlc April 16, 2024 01:50 — with GitHub Actions Inactive

reductionista marked this pull request as ready for review April 16, 2024 01:50

reductionista requested a review from a team as a code owner April 16, 2024 01:50

reductionista temporarily deployed to sdlc April 16, 2024 02:37 — with GitHub Actions Inactive

reductionista temporarily deployed to sdlc April 16, 2024 02:47 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from bb8240a to ad14f8f Compare April 16, 2024 02:48

reductionista temporarily deployed to sdlc April 16, 2024 02:48 — with GitHub Actions Inactive

mateusz-sekara reviewed Apr 22, 2024

View reviewed changes

reductionista temporarily deployed to sdlc April 23, 2024 18:46 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from f561a6b to 7ba8818 Compare April 23, 2024 18:47

reductionista temporarily deployed to sdlc April 23, 2024 18:47 — with GitHub Actions Inactive

reductionista temporarily deployed to sdlc April 24, 2024 03:37 — with GitHub Actions Inactive

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from 4060d19 to 9a88e65 Compare April 24, 2024 19:56

reductionista temporarily deployed to sdlc April 24, 2024 19:56 — with GitHub Actions Inactive

Reduce unnecessary code duplication

6531a16

getCurrentBlockMaybeHandleReorg is called just before the for loop over unfinalized blocks begins, and at the end of each iteration. Simplifying by moving them both to the beginning of the for loop

reductionista added 7 commits April 24, 2024 12:56

Update logpoller tests

f39e434

fix merge conflict

cbfc6c3

reduce cognitive complexity

711cfa7

Add validationReqType type definition

4059181

Fix comments

f199234

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from 9a88e65 to 70a17ee Compare April 24, 2024 19:57

reductionista temporarily deployed to sdlc April 24, 2024 19:57 — with GitHub Actions Inactive

Add Test_FetchBlocks

738716d

reductionista force-pushed the BCF-2908-logpoller-atomic-finality branch from 70a17ee to 738716d Compare April 24, 2024 19:58

reductionista temporarily deployed to sdlc April 24, 2024 19:58 — with GitHub Actions Inactive

mateusz-sekara approved these changes Apr 25, 2024

View reviewed changes

reductionista added this pull request to the merge queue Apr 26, 2024

Merged via the queue into develop with commit 1d9dd46 Apr 26, 2024
106 checks passed

reductionista deleted the BCF-2908-logpoller-atomic-finality branch April 26, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make LogPoller more robust against local finality violations #12605

Make LogPoller more robust against local finality violations #12605

reductionista commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Apr 16, 2024

mateusz-sekara left a comment

reductionista commented Apr 23, 2024 •

edited

Loading

cl-sonarqube-production bot commented Apr 24, 2024

mateusz-sekara Apr 25, 2024

Make LogPoller more robust against local finality violations #12605

Make LogPoller more robust against local finality violations #12605

Conversation

reductionista commented Mar 26, 2024 • edited Loading

github-actions bot commented Mar 26, 2024 • edited Loading

github-actions bot commented Apr 16, 2024

mateusz-sekara left a comment

Choose a reason for hiding this comment

reductionista commented Apr 23, 2024 • edited Loading

cl-sonarqube-production bot commented Apr 24, 2024

Quality Gate passed

mateusz-sekara Apr 25, 2024

Choose a reason for hiding this comment

reductionista commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 26, 2024 •

edited

Loading

reductionista commented Apr 23, 2024 •

edited

Loading