Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] op-node parse L1 batch data error: frameLength is too large: 2838834252 #4937

Closed
nolanxyg opened this issue Feb 22, 2023 · 22 comments · Fixed by #5050
Closed

[BUG] op-node parse L1 batch data error: frameLength is too large: 2838834252 #4937

nolanxyg opened this issue Feb 22, 2023 · 22 comments · Fixed by #5050
Assignees

Comments

@nolanxyg
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

Op-node parse batch data submitted by batcher encounter a irretrievable error: frameLength is too large: 2838834252, which will make op-node unable to reconcile with L1

To Reproduce
Steps to reproduce the behavior:

  1. stop batcher for several hours
  2. start batcher
  3. op-node print error log: Feb 22, 2023 @ 11:32:26.580 WARN [02-22|03:32:26.580] Failed to parse frames origin=f4c844..525ff1:1589279 err="frameLength is too large: 2838834252"

Expected behavior
A clear and concise description of what you expected to happen.

Op-node should keep reconcile with L1

Screenshots
If applicable, add screenshots to help explain your problem.

System Specs:

  • OS: Ubuntu
  • Package Version (or commit hash): 173181b

Additional context
Add any other context about the problem here.

@protolambda
Copy link
Contributor

This error looks wrong to me: the frameLength should never be that large. I think it hit a decoding error, and read the wrong data as frame-length.

In the verifier it was recently changed (thus also the different error message formatting) due to a finding of the Sherlock bug-bounty contest; #4867 which may be related.

But the error path is still the same:

This verifier code-path here functions as expected: bad frames are dropped and skipped over.

The batcher however needs to ensure it only outputs valid frames, or the frames will not be accepted.

The maximum frame size is configurable:

And based on the max L1 tx size by default:

MaxFrameSize: cfg.MaxL1TxSize - 1, // subtract 1 byte for version

Which is smaller than the enforced frame size:

const MaxFrameLen = 1_000_000

So assuming the settings are left to defaults, then this may indicate some bug (that may be related or duplicate of the Sherlock contest finding) in the frame-encoding when buffering large amounts of data that corrupts the output (resulting in the ridiculous 2.8+ GB frame-size number).

cc @sebastianst flagging possible op-batcher bug.

@sebastianst
Copy link
Member

Thanks @protolambda for the initial analysis!

The frameLength of 2838834252 looks suspicious to me because it has the highest uint32 bit set: I think it never was a frameLength to begin with but instead is just random data that got decoded into a frame length. The unmarshaling code doesn't do any sanity check before this first check on the max. frame length. I bet the decoded frame number and channel id before are also random.

@nolanxyg could you provide some more context?

  • What op-batcher and op-node configuration do you use?
    • In particular, what channel configuration (target and max tx size, approx. compr. ratio)
  • What's your L1 here?
  • Is there any tx volume happening during the hours that your batcher is offline?

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.

@sebastianst
Copy link
Member

Tracking at CLI-3428

@sebastianst
Copy link
Member

@nolanxyg see my above message, would love to get more info so that we can reproduce 🙏🏻

@nolanxyg
Copy link
Contributor Author

Sorry🙏🏻
@s7v7nislands pls help to provide more context, thanks

@nolanxyg
Copy link
Contributor Author

@nolanxyg could you provide some more context?

  • What op-batcher and op-node configuration do you use?

    • In particular, what channel configuration (target and max tx size, approx. compr. ratio)
  • What's your L1 here?

  • Is there any tx volume happening during the hours that your batcher is offline?

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.

image

  • BSC
  • Yes, we were doing load test

OK, we will try the recent commit version to see if this issue is fixed

@sebastianst
Copy link
Member

sebastianst commented Feb 27, 2023

@nolanxyg Thanks, some op-node and L2 config would also be appreciated. E.g. what's you L2 block time? Note BSC seems to have a block time of 3 sec.

With your batcher configuration, it will have a hard time catching up because of the low target of 624 bytes. That value is/was set in our tests to force eager batch generation. But in realistic scenarios with high L2 volume, it's too low because each channel will only contain ~624 bytes of input data (and so even less output data because it gets compressed) and batcher txs are send sequentially.

Yesterday I added a new channel duration limiting functionality to the batcher at #4990

I suggest you test your system with this new batcher and also change the config as suggested in this PR's description, but with higher max channel duration and sub safety margin. It allows you to set a low channel duration so that channels are not kept open for too long (which gives a better UX at the expense of emptier channels during low L2 tx volume). This could be set to e.g. MaxChannelDuration = 40 in a realistic setting to ensure that single-frame channels are submitted to L1 after at most 2 minutes (40 * 3 sec (L1 block time) = 2 min). But I don't know your requirements so whether you want to use channel duration limiting at all depends on your use case. My guess is you want to enable it, especially in your testnet.

Then also increase the target tx size to 100000 and use a realistic value for the approx. compression ratio (0.4). Note also that newer batchers pull the channel timeout from the rollup node, so you can delete that parameter (OP_BATCHER_CHANNEL_TIMEOUT).

tl;dr

Use batcher from #4990 and try with the following config (convert to env var names)

		MaxChannelDuration:        40, // 40 * 3 sec (L1 block time) = 2 min
		MaxL1TxSize:               120_000,
		TargetL1TxSize:            100_000,
		TargetNumFrames:           1,
		ApproxComprRatio:          0.4,
		SubSafetyMargin:           20,

This is similar to our e2e test config, but has a higher max channel duration and submission safety margin.

@nolanxyg
Copy link
Contributor Author

Cool, Thanks a ton! @sebastianst 👍

Here's rollup.json:

{
  "genesis": {
    "l1": {
      "hash": "0x32a66f95f9e04e3eb3bc71b4b02b26ff1280ce8132aa0d58e9cd9f5bcfe6f2e3",
      "number": 1740287
    },
    "l2": {
      "hash": "0x74207c6f3ede9fa5e0db10aee7a3e13335978fbf07ef816b4c0bb964133ac0fa",
      "number": 0
    },
    "l2_time": 1677489770,
    "system_config": {
      "batcherAddr": "0xd532d4705df2f6422b02aba6a55955f21aab2294",
      "overhead": "0x0000000000000000000000000000000000000000000000000000000000000834",
      "scalar": "0x00000000000000000000000000000000000000000000000000000000000f4240",
      "gasLimit": 100000000
    }
  },
  "block_time": 1,
  "max_sequencer_drift": 300,
  "seq_window_size": 16800,
  "channel_timeout": 40,
  "l1_chain_id": 714,
  "l2_chain_id": 716,
  "batch_inbox_address": "0xff00000000000000000000000000000000000000",
  "deposit_contract_address": "0xe4562148368da2e2f94b2c590c17b550c79eb5c6",
  "l1_system_config_address": "0x9609d0e7084e59f9b779f9976e9c07c039f8a026"
}

For geth-config.toml, i have ran geth dumpconfig but got an error, which is same with: ethereum/go-ethereum#26328, so i only could show the start command:

geth \
                  --datadir="/db" \
                  --verbosity=3 \
                  --http \
                  --http.corsdomain="*" \
                  --http.vhosts="*" \
                  --http.addr=0.0.0.0 \
                  --http.port=8545 \
                  --http.api=eth,engine \
                  --ws \
                  --ws.addr=0.0.0.0 \
                  --ws.port=8546 \
                  --ws.origins="*" \
                  --ws.api=eth,engine \
                  --syncmode=full \
                  --nodiscover \
                  --maxpeers=1 \
                  --networkid=$(CHAIN_ID) \
                  --unlock="$(BLOCK_SIGNER_ADDRESS)" \
                  --mine \
                  --miner.etherbase="$(BLOCK_SIGNER_ADDRESS)" \
                  --miner.gaslimit=25000000 \
                  --password=/db/password \
                  --allow-insecure-unlock \
                  --authrpc.addr="0.0.0.0" \
                  --authrpc.port="8551" \
                  --authrpc.vhosts="*" \
                  --authrpc.jwtsecret=/config/jwt-secret.txt \
                  --gcmode=archive \
                  --verbosity=4 \
                  --pprof \
                  --metrics \
                  --metrics.port 6060 \
                  --metrics.addr 0.0.0.0

Thanks for pointing 624 bytes out! We also found that's too small, and we'll optimize the parameters according to your advice, really appreciate!

We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?

@sebastianst
Copy link
Member

sebastianst commented Feb 28, 2023

Quick feedback on your rollup config:

  • a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
  • why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.

Looking forward to your feedback on the new batcher!

@nolanxyg
Copy link
Contributor Author

nolanxyg commented Mar 1, 2023

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.

Hi @sebastianst, we have tried again with code version: https://github.com/ethereum-optimism/optimism/releases/tag/op-proposer%2Fv1.0.0-rc.1, and the frameLength is too large error came up again

Here's what we did:

  1. Setup the whole network (config with info mentioned above)
  2. Stop batcher for several hours (or batcher run out of money in our case as showed in below monitor)

image

  1. Start batcher (or fund to batcher in our case at 03-01 10:43, and batcher ran out of money soon, so we fund to it again at 03-01 11:29)
  2. check the log of op-node, we found frameLength is too large error log at about 03-01 10:43 and 03-01 11:29, which is the time batcher return to work

image

@nolanxyg
Copy link
Contributor Author

nolanxyg commented Mar 1, 2023

Quick feedback on your rollup config:

  • a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
  • why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.

Looking forward to your feedback on the new batcher!

  • got, how about set op_batcher__channel_timeout to 40*12/3 and channel_timieout in rollup.json to 120*12/3?

    • 12 means Ethereum block time , and 3 means BSC block time
  • the seq_window_size value in Ethereum environment is 3600, we considered BSC block time is 3s, so 3600*12/3 = 14400 ≈ 16800

@protolambda
Copy link
Contributor

We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?

No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.

If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.

@sebastianst
Copy link
Member

Quick feedback on your rollup config:

  • a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
  • why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.
Looking forward to your feedback on the new batcher!

* got, how about set `op_batcher__channel_timeout` to `40*12/3` and ` channel_timieout` in rollup.json to `120*12/3`?
  
  * `12` means Ethereum block time , and `3` means BSC block time

* the `seq_window_size` value in Ethereum environment is [3600](https://github.com/ethereum-optimism/optimism/blob/3177f8850ed94c7935032155f94dd2629d809f96/op-node/chaincfg/chains.go#L62), we considered BSC block time is 3s, so `3600*12/3 = 14400 ≈ 16800`
  • The new batcher queries the channel timeout from its rollup node, so this parameter (OP_BATCHER_CHANNEL_TIMEOUT) is obsolete.
  • A channel timeout of 480 in your rollup.json sounds reasonable then. 3600*4=16800 for the SWS should be fine too.
  • Did you try the batcher from op-batcher: Add channel duration limiting #4990 (commit c8f16b5) yet? Try it with this config
      OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
      OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
      OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
      OP_BATCHER_TARGET_NUM_FRAMES: 1
      OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
      OP_BATCHER_SUB_SAFETY_MARGIN: 20

The high target and max tx size make sure that the batcher sends full frames when it's catching up after being offline for some time.

@sebastianst
Copy link
Member

sebastianst commented Mar 1, 2023

@nolanxyg I looked at your logs screenshot. It's conspicuous that the error at 10:43 mentions a frame number of 14. Because the batcher should actually just send txs whose data only includes 1-2 frame(s) with your config (OP_BATCHER_TARGET_NUM_FRAMES: 1). So it looks to me like the frame data is total garbage. But then it seems like a miracle that parsing gets until "frame 14" and the 13 "frames" before that unmarshal without error. Also, the first byte of the tx data must be correctly set to 0 = DerivationVersion0.

I'm curious if you still get errors with the #4990 batcher. Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?

@nolanxyg
Copy link
Contributor Author

nolanxyg commented Mar 2, 2023

No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.

If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.

Nice explanation, understood, thanks! @protolambda 👍

We set l2_block_time to 1s considering there's no consensus on l2 for now, so we tried a aggressive value. Does 1s block time have potential risks? (If so, we'll change back to 2s)

@nolanxyg
Copy link
Contributor Author

nolanxyg commented Mar 2, 2023

Did you try the batcher from #4990 (commit c8f16b5) yet? Try it with this config

     OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
     OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
     OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
     OP_BATCHER_TARGET_NUM_FRAMES: 1
     OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
     OP_BATCHER_SUB_SAFETY_MARGIN: 20

OK, we'll try this, and give the feedback soon

Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?

Sure

@sebastianst
Copy link
Member

@nolanxyg #4990 just got merged into develop, so you can use the batcher from develop now.

@sebastianst
Copy link
Member

I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?

@nolanxyg
Copy link
Contributor Author

nolanxyg commented Mar 2, 2023

If to be the same with our steps, i think you could spin up a network, and test everything worked properly(batcher worked fine as well, but the balance of batcher better be low), produce a good amount of L2 tx volume, then batcher will be out of money soon, after a few hours(e.g. a night), fund to batcher to get it back to work, finally check if the network recovered and if the frame error log show up

Because i doubt that maybe the interrupt of sending submit-tx(due to insufficient funds) cause the error

ERROR[02-26|09:11:42.718] unable to publish transaction            service=batcher txHash=9aadf5..d71a8f nonce=55244 gasTipCap=20,000,000,000 gasFeeCap=20,000,000,000 err="insufficient funds for gas * price + value"

@s7v7nislands
Copy link
Contributor

@sebastianst I think the pr fix the bug, please take a look

@s7v7nislands
Copy link
Contributor

I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?

You can change the code in txmgr.go, always return an error in SendTransaction(), and print the data in nextTxData, you will find the version byte is repeated repeatedly.

@sebastianst
Copy link
Member

Thanks! That should fix it indeed! I'll also take a closer look whether it makes sense to clean this up with a small refactor, because this prepending and appending of version zeros is quite error prone 😅

sebastianst added a commit that referenced this issue Mar 11, 2023
sebastianst added a commit that referenced this issue Mar 11, 2023
sebastianst added a commit that referenced this issue Mar 11, 2023
sebastianst added a commit that referenced this issue Mar 11, 2023
sebastianst added a commit that referenced this issue Mar 13, 2023
sebastianst added a commit that referenced this issue Mar 14, 2023
sebastianst added a commit that referenced this issue Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants