-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] op-node parse L1 batch data error: frameLength is too large: 2838834252
#4937
Comments
This error looks wrong to me: the frameLength should never be that large. I think it hit a decoding error, and read the wrong data as frame-length. In the verifier it was recently changed (thus also the different error message formatting) due to a finding of the Sherlock bug-bounty contest; #4867 which may be related. But the error path is still the same:
This verifier code-path here functions as expected: bad frames are dropped and skipped over. The batcher however needs to ensure it only outputs valid frames, or the frames will not be accepted. The maximum frame size is configurable:
And based on the max L1 tx size by default: optimism/op-batcher/batcher/driver.go Line 92 in 3468c6a
Which is smaller than the enforced frame size: optimism/op-node/rollup/derive/frame.go Line 14 in 72f24f1
So assuming the settings are left to defaults, then this may indicate some bug (that may be related or duplicate of the Sherlock contest finding) in the frame-encoding when buffering large amounts of data that corrupts the output (resulting in the ridiculous 2.8+ GB frame-size number). cc @sebastianst flagging possible op-batcher bug. |
Thanks @protolambda for the initial analysis! The @nolanxyg could you provide some more context?
Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b. |
Tracking at CLI-3428 |
@nolanxyg see my above message, would love to get more info so that we can reproduce 🙏🏻 |
Sorry🙏🏻 |
OK, we will try the recent commit version to see if this issue is fixed |
@nolanxyg Thanks, some op-node and L2 config would also be appreciated. E.g. what's you L2 block time? Note BSC seems to have a block time of 3 sec. With your batcher configuration, it will have a hard time catching up because of the low target of Yesterday I added a new channel duration limiting functionality to the batcher at #4990 I suggest you test your system with this new batcher and also change the config as suggested in this PR's description, but with higher max channel duration and sub safety margin. It allows you to set a low channel duration so that channels are not kept open for too long (which gives a better UX at the expense of emptier channels during low L2 tx volume). This could be set to e.g. Then also increase the target tx size to 100000 and use a realistic value for the approx. compression ratio (0.4). Note also that newer batchers pull the channel timeout from the rollup node, so you can delete that parameter ( tl;drUse batcher from #4990 and try with the following config (convert to env var names)
This is similar to our e2e test config, but has a higher max channel duration and submission safety margin. |
Cool, Thanks a ton! @sebastianst 👍 Here's
For
Thanks for pointing We have modified the |
Quick feedback on your rollup config:
We'll get back regarding the Looking forward to your feedback on the new batcher! |
Hi @sebastianst, we have tried again with code version: https://github.com/ethereum-optimism/optimism/releases/tag/op-proposer%2Fv1.0.0-rc.1, and the Here's what we did:
|
|
No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards. If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock. |
OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
OP_BATCHER_TARGET_NUM_FRAMES: 1
OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
OP_BATCHER_SUB_SAFETY_MARGIN: 20 The high target and max tx size make sure that the batcher sends full frames when it's catching up after being offline for some time. |
@nolanxyg I looked at your logs screenshot. It's conspicuous that the error at 10:43 mentions a frame number of 14. Because the batcher should actually just send txs whose data only includes 1-2 frame(s) with your config ( I'm curious if you still get errors with the #4990 batcher. Also, could you enable batcher debug logging ( |
Nice explanation, understood, thanks! @protolambda 👍 We set |
OK, we'll try this, and give the feedback soon
Sure |
I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it? |
If to be the same with our steps, i think you could spin up a network, and test everything worked properly(batcher worked fine as well, but the balance of batcher better be low), produce a good amount of L2 tx volume, then batcher will be out of money soon, after a few hours(e.g. a night), fund to batcher to get it back to work, finally check if the network recovered and if the Because i doubt that maybe the interrupt of sending submit-tx(due to insufficient funds) cause the error
|
@sebastianst I think the pr fix the bug, please take a look |
You can change the code in txmgr.go, always return an error in SendTransaction(), and print the data in nextTxData, you will find the version byte is repeated repeatedly. |
Thanks! That should fix it indeed! I'll also take a closer look whether it makes sense to clean this up with a small refactor, because this prepending and appending of version zeros is quite error prone 😅 |
Describe the bug
A clear and concise description of what the bug is.
Op-node parse batch data submitted by batcher encounter a irretrievable error:
frameLength is too large: 2838834252
, which will make op-node unable to reconcile with L1To Reproduce
Steps to reproduce the behavior:
Feb 22, 2023 @ 11:32:26.580 WARN [02-22|03:32:26.580] Failed to parse frames origin=f4c844..525ff1:1589279 err="frameLength is too large: 2838834252"
Expected behavior
A clear and concise description of what you expected to happen.
Op-node should keep reconcile with L1
Screenshots
If applicable, add screenshots to help explain your problem.
System Specs:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: