Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LogPoller more robust against local finality violations #12605

Merged
merged 9 commits into from
Apr 26, 2024

Conversation

reductionista
Copy link
Contributor

@reductionista reductionista commented Mar 26, 2024

The main focus of this PR is to help LogPoller detect, prevent "local finality violations". These are cases where the multinode layer fails over from one rpc several to another and the second one is behind on its view of the chain. This can make it look like there is a finality violation even when there was not any global finality violation on the chain.

Primary change:

  • Checks that all results from batchFetchBlocks() are finalized aside from "latest"
    batchFetchBlocks() will now fetch the "finalized" block along with the rest of each batch,
    and validate that all of the block numbers (aside from the special case when "latest" is requested)
    are <= the finalized block number returned.

Also in this PR are two refactors for reducing code complexity:

  • Change backfill() to always save the last block of each batch of logs requested, rather than the last block of the logs returned.
    (This only makes a difference if the last block requested has no logs matching the filter, but this change will allow us to eliminate
    lastSafeBlockNumber = latestFinalizedBlock - 1 in an upcoming PR in favor of latestFinalizedBlock which simplifies the overall LogPoller
    implementation. It also gets us a step closer to being able to use "finalized" for the upper range of the final batch request for logs, but that
    comes with some additional complexities which still need to be worked out.)

  • Start Backup LogPoller on lastProcessed.FinalizedBlockNumber instead of lastProcessed.FinalizedBlockNumber - 1. This was a harmless "bug" where it was starting one block too early, but entirely separate from the previous change.

  • Minor refactor to remove code duplication (condensing 3 replicated code blocks calling getCurrentBlockMaybeHandleReorg down to 2)

Copy link
Contributor

github-actions bot commented Mar 26, 2024

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

#nops : For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#added : For any new functionality added.
#changed : For any change to the existing functionality. 
#removed : For any functionality/config that is removed.
#updated : For any functionality that is updated.
#deprecation_notice : For any upcoming deprecation functionality.
#breaking_change : For any functionality that requires manual action for the node to boot.
#db_update : For any feature that introduces updates to database schema.
#wip : For any change that is not ready yet and external communication about it should be held off till it is feature complete.
#bugfix - For bug fixes.
#internal - For changesets that need to be excluded from the final changelog.

Copy link
Contributor

I see you added a changeset file but it does not contain a tag. Please edit the text include at least one of the following tags:

#nops : For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#added : For any new functionality added.
#changed : For any change to the existing functionality. 
#removed : For any functionality/config that is removed.
#updated : For any functionality that is updated.
#deprecation_notice : For any upcoming deprecation functionality.
#breaking_change : For any functionality that requires manual action for the node to boot.
#db_update : For any feature that introduces updates to database schema.
#wip : For any change that is not ready yet and external communication about it should be held off till it is feature complete.
#bugfix - For bug fixes.
#internal - For changesets that need to be excluded from the final changelog.

Copy link
Collaborator

@mateusz-sekara mateusz-sekara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I follow. How does validating whether blocks are finalized fix the issues with local finality violations? I think I miss the bigger picture here

core/chains/evm/logpoller/log_poller.go Show resolved Hide resolved
core/chains/evm/logpoller/log_poller.go Outdated Show resolved Hide resolved
core/chains/evm/logpoller/log_poller.go Outdated Show resolved Hide resolved
core/chains/evm/logpoller/log_poller.go Outdated Show resolved Hide resolved
@reductionista
Copy link
Contributor Author

reductionista commented Apr 23, 2024

I'm not sure if I follow. How does validating whether blocks are finalized fix the issues with local finality violations? I think I miss the bigger picture here

Sorry, I should have provided more explanation on that in the PR description. The flow of PollAndSaveLogs() includes this sequence of calls:

  1. latestBlock, latestFinalizedBlockNumber, err := lp.latestBlocks(lp.ctx)
  2. currentBlock, err = lp.getCurrentBlockMaybeHandleReorg(ctx, currentBlockNumber, currentBlock)
  3. lastSafeBackfillBlock := latestFinalizedBlockNumber - 1
  4. lp.backfill(ctx, currentBlockNumber, lastSafeBackfillBlock)
    4a. gethLogs, err := lp.ec.FilterLogs(ctx, lp.Filter(big.NewInt(from), big.NewInt(to), nil))
    4b. blocks, err := lp.blocksFromLogs(ctx, gethLogs)
    4c. lp.orm.InsertLogsWithBlock(ctx, convertLogs(gethLogs, blocks, lp.lggr, lp.ec.ConfiguredChainID()), blocks[len(blocks)-1])

(omitting some inconsequential steps, these are the main ones)

The scenario where a local finality violation can cause problems is when the primary rpc server returns latestFinalizedBlockNumber in step 1, then there is a failover to an out-of-sync rpc server somewhere between steps 1 and 4b. In step 4b, blocksFromLogs() calls GetBlocksRange() to fetch all of the blocks associated with the list of logs passed to it in parallel. If it's connected to the out-of-sync rpc server at this point, then some or all of the blocks it gets back may be unfinalized blocks... they may not even be part of the canonical chain. If the last block get saved to the db as a finalized block in step 4c but it's actually not finalized, then we're in a situation where the db is corrupt and this can easily cause the node to get stuck. For example, if the primary rpc server comes back up, and now it's receiving blocks from the canonical chain, it may not be able to verify that those have any connection to the blocks in the db which it trusts because it thinks they are all "finalized".

The bottom line is, we only ever save finalized blocks to the db... the rpc failover scenario opened a loophole where we could save an unfinalized block for a block number we were previously told was finalized. This mostly closes that loophole. (The only potential edge case is if the requests sent in a single batch get executed out of order, and one or more of the blocks we're requesting gets re-org'd out and then finalized while the batch in being processed. That seems very unlikely, but we have another fix coming soon at the lower (multinode layer) level that should cover that scenario too.)

I was originally thinking mostly about fixing things in step 4a, where it requests the logs. If the failover happens between steps 1 and 4a then it can get back unfinalized logs, which is also a problem. But this fix should take care of both that situation and one where it fails over between steps 4a and 4b, because if blocksFromLogs returns an error then it won't proceed to step 4c so neither the unfinalized logs nor the unfinalized block will get saved to the db. It will just retry everything again next time around.

getCurrentBlockMaybeHandleReorg is called just before the for loop over
unfinalized blocks begins, and at the end of each iteration. Simplifying
by moving them both to the beginning of the for loop
This fixes 2 bugs on develop branch in this test, and removes some
unused commented code.

First Bug
=========
The first bug was causing a false positive PASS on develop branch, which
was obscuring a (very minor) bug in BackupPoller that's been fixed in this PR.

The comment here about what the test was intended to test is still correct:

 // Only the 2nd batch + 1 log from a previous batch should be backfilled, because we perform backfill starting
 // from one block behind the latest finalized block

Contrary to the comment, the code was returning 2 logs from the 1st batch (Data=9 & Data=10), plus 9 of 10 logs
from the 2nd batch. This was incorrect behavior, but the test was also checking for the same incorrect behavior
(looking for 11 logs with first one being Data=9) instead of what's described in the comment.

The bug in the production code was that it starts the Backup Poller at Finalized - 1 instead of Finalized.
This is a harmless "bug", just unnecessarily starting a block too early, since there's no reason for backup
logpoller to re-request the same finalized logs that's already been processed.

Now, the code returns the last log from the 1st batch + all but one logs
from the 2nd batch, which is correct. (It can't return the last log
because that goes beyond the last safe block.) So the test checks that
there are 10 logs with first one being Data=10 (last log from the first
batch.)

Second Bug
==========
The second bug was passing firstBatchBlock and secondBatchBlock directly
to markBlockAsFinalized() instead of passing firstBatchBlock - 1 and
secondBatchBlock - 1. This was only working because of a bug in the
version of geth we're currently using: when you request the pending
block from simulated geth, it gives you back the current block (1 block
prior) instead of the current block. (For example, in the first case,
even though we wanted block 11, the latest current block, we request
block 12 and get back block 11.) This has been fixed in the latest
version of geth... so presumably if we don't fix this here the test
would have started failing as soon as we upgrade to the latest version
of geth. It doesn't change any behavior of the test for the present
version of geth, just makes it more clear that we want block 11 not 12.
…om "latest"

batchFetchBlocks() will now fetch the "finalized" block along with the
rest of each batch, and validate that all of the block numbers (aside from the
special when "lateest" is requested) are <= the finalized block number
returned.

Also, change backfill() to always save the last block of each batch of
logs requested, rather than the last block of the logs returned. This
only makes a difference if the last block requested has no logs matching
the filter, but this change is essential for being able to safely change
lastSafeBlockNumber from latestFinalizedBlock - 1 to latestFinalizedBlock
@@ -554,6 +599,77 @@ func Test_latestBlockAndFinalityDepth(t *testing.T) {
})
}

func Test_FetchBlocks(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants