[BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252` #4937

nolanxyg · 2023-02-22T10:12:30Z

Describe the bug
A clear and concise description of what the bug is.

Op-node parse batch data submitted by batcher encounter a irretrievable error: frameLength is too large: 2838834252, which will make op-node unable to reconcile with L1

To Reproduce
Steps to reproduce the behavior:

stop batcher for several hours
start batcher
op-node print error log: Feb 22, 2023 @ 11:32:26.580 WARN [02-22|03:32:26.580] Failed to parse frames origin=f4c844..525ff1:1589279 err="frameLength is too large: 2838834252"

Expected behavior
A clear and concise description of what you expected to happen.

Op-node should keep reconcile with L1

Screenshots
If applicable, add screenshots to help explain your problem.

System Specs:

OS: Ubuntu
Package Version (or commit hash): 173181b

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

protolambda · 2023-02-22T21:20:58Z

This error looks wrong to me: the frameLength should never be that large. I think it hit a decoding error, and read the wrong data as frame-length.

In the verifier it was recently changed (thus also the different error message formatting) due to a finding of the Sherlock bug-bounty contest; #4867 which may be related.

But the error path is still the same:

The frame is too large:

optimism/op-node/rollup/derive/frame.go

Line 90 in 72f24f1

return fmt.Errorf("frame_data_length is too large: %d", frameLength)
The frame decoding errors:

optimism/op-node/rollup/derive/frame.go

Line 140 in 72f24f1

if err := f.UnmarshalBinary(buf); err != nil {
The bad frame is ignored:

optimism/op-node/rollup/derive/frame_queue.go

Line 45 in 9a198e9

fq.log.Warn("Failed to parse frames", "origin", fq.prev.Origin(), "err", err)

This verifier code-path here functions as expected: bad frames are dropped and skipped over.

The batcher however needs to ensure it only outputs valid frames, or the frames will not be accepted.

The maximum frame size is configurable:

optimism/op-batcher/batcher/channel_builder.go

Line 26 in e03b507

MaxFrameSize uint64

And based on the max L1 tx size by default:

optimism/op-batcher/batcher/driver.go

Line 92 in 3468c6a

MaxFrameSize: cfg.MaxL1TxSize - 1, // subtract 1 byte for version

Which is smaller than the enforced frame size:

optimism/op-node/rollup/derive/frame.go

Line 14 in 72f24f1

const MaxFrameLen = 1_000_000

So assuming the settings are left to defaults, then this may indicate some bug (that may be related or duplicate of the Sherlock contest finding) in the frame-encoding when buffering large amounts of data that corrupts the output (resulting in the ridiculous 2.8+ GB frame-size number).

cc @sebastianst flagging possible op-batcher bug.

sebastianst · 2023-02-22T22:22:15Z

Thanks @protolambda for the initial analysis!

The frameLength of 2838834252 looks suspicious to me because it has the highest uint32 bit set: I think it never was a frameLength to begin with but instead is just random data that got decoded into a frame length. The unmarshaling code doesn't do any sanity check before this first check on the max. frame length. I bet the decoded frame number and channel id before are also random.

@nolanxyg could you provide some more context?

What op-batcher and op-node configuration do you use?
- In particular, what channel configuration (target and max tx size, approx. compr. ratio)
What's your L1 here?
Is there any tx volume happening during the hours that your batcher is offline?

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.

sebastianst · 2023-02-23T10:20:29Z

Tracking at CLI-3428

sebastianst · 2023-02-25T11:04:23Z

@nolanxyg see my above message, would love to get more info so that we can reproduce 🙏🏻

nolanxyg · 2023-02-26T10:51:05Z

Sorry🙏🏻
@s7v7nislands pls help to provide more context, thanks

nolanxyg · 2023-02-27T09:27:32Z

@nolanxyg could you provide some more context?

What op-batcher and op-node configuration do you use?

In particular, what channel configuration (target and max tx size, approx. compr. ratio)

What's your L1 here?

Is there any tx volume happening during the hours that your batcher is offline?

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.

BSC
Yes, we were doing load test

OK, we will try the recent commit version to see if this issue is fixed

sebastianst · 2023-02-27T11:30:27Z

@nolanxyg Thanks, some op-node and L2 config would also be appreciated. E.g. what's you L2 block time? Note BSC seems to have a block time of 3 sec.

With your batcher configuration, it will have a hard time catching up because of the low target of 624 bytes. That value is/was set in our tests to force eager batch generation. But in realistic scenarios with high L2 volume, it's too low because each channel will only contain ~624 bytes of input data (and so even less output data because it gets compressed) and batcher txs are send sequentially.

Yesterday I added a new channel duration limiting functionality to the batcher at #4990

I suggest you test your system with this new batcher and also change the config as suggested in this PR's description, but with higher max channel duration and sub safety margin. It allows you to set a low channel duration so that channels are not kept open for too long (which gives a better UX at the expense of emptier channels during low L2 tx volume). This could be set to e.g. MaxChannelDuration = 40 in a realistic setting to ensure that single-frame channels are submitted to L1 after at most 2 minutes (40 * 3 sec (L1 block time) = 2 min). But I don't know your requirements so whether you want to use channel duration limiting at all depends on your use case. My guess is you want to enable it, especially in your testnet.

Then also increase the target tx size to 100000 and use a realistic value for the approx. compression ratio (0.4). Note also that newer batchers pull the channel timeout from the rollup node, so you can delete that parameter (OP_BATCHER_CHANNEL_TIMEOUT).

tl;dr

Use batcher from #4990 and try with the following config (convert to env var names)

		MaxChannelDuration:        40, // 40 * 3 sec (L1 block time) = 2 min
		MaxL1TxSize:               120_000,
		TargetL1TxSize:            100_000,
		TargetNumFrames:           1,
		ApproxComprRatio:          0.4,
		SubSafetyMargin:           20,

This is similar to our e2e test config, but has a higher max channel duration and submission safety margin.

nolanxyg · 2023-02-28T10:44:42Z

Cool, Thanks a ton! @sebastianst 👍

Here's rollup.json:

{
  "genesis": {
    "l1": {
      "hash": "0x32a66f95f9e04e3eb3bc71b4b02b26ff1280ce8132aa0d58e9cd9f5bcfe6f2e3",
      "number": 1740287
    },
    "l2": {
      "hash": "0x74207c6f3ede9fa5e0db10aee7a3e13335978fbf07ef816b4c0bb964133ac0fa",
      "number": 0
    },
    "l2_time": 1677489770,
    "system_config": {
      "batcherAddr": "0xd532d4705df2f6422b02aba6a55955f21aab2294",
      "overhead": "0x0000000000000000000000000000000000000000000000000000000000000834",
      "scalar": "0x00000000000000000000000000000000000000000000000000000000000f4240",
      "gasLimit": 100000000
    }
  },
  "block_time": 1,
  "max_sequencer_drift": 300,
  "seq_window_size": 16800,
  "channel_timeout": 40,
  "l1_chain_id": 714,
  "l2_chain_id": 716,
  "batch_inbox_address": "0xff00000000000000000000000000000000000000",
  "deposit_contract_address": "0xe4562148368da2e2f94b2c590c17b550c79eb5c6",
  "l1_system_config_address": "0x9609d0e7084e59f9b779f9976e9c07c039f8a026"
}

For geth-config.toml, i have ran geth dumpconfig but got an error, which is same with: ethereum/go-ethereum#26328, so i only could show the start command:

geth \
                  --datadir="/db" \
                  --verbosity=3 \
                  --http \
                  --http.corsdomain="*" \
                  --http.vhosts="*" \
                  --http.addr=0.0.0.0 \
                  --http.port=8545 \
                  --http.api=eth,engine \
                  --ws \
                  --ws.addr=0.0.0.0 \
                  --ws.port=8546 \
                  --ws.origins="*" \
                  --ws.api=eth,engine \
                  --syncmode=full \
                  --nodiscover \
                  --maxpeers=1 \
                  --networkid=$(CHAIN_ID) \
                  --unlock="$(BLOCK_SIGNER_ADDRESS)" \
                  --mine \
                  --miner.etherbase="$(BLOCK_SIGNER_ADDRESS)" \
                  --miner.gaslimit=25000000 \
                  --password=/db/password \
                  --allow-insecure-unlock \
                  --authrpc.addr="0.0.0.0" \
                  --authrpc.port="8551" \
                  --authrpc.vhosts="*" \
                  --authrpc.jwtsecret=/config/jwt-secret.txt \
                  --gcmode=archive \
                  --verbosity=4 \
                  --pprof \
                  --metrics \
                  --metrics.port 6060 \
                  --metrics.addr 0.0.0.0

Thanks for pointing 624 bytes out! We also found that's too small, and we'll optimize the parameters according to your advice, really appreciate!

We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?

sebastianst · 2023-02-28T11:38:17Z

Quick feedback on your rollup config:

a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.

Looking forward to your feedback on the new batcher!

nolanxyg · 2023-03-01T08:58:53Z

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.

Hi @sebastianst, we have tried again with code version: https://github.com/ethereum-optimism/optimism/releases/tag/op-proposer%2Fv1.0.0-rc.1, and the frameLength is too large error came up again

Here's what we did:

Setup the whole network (config with info mentioned above)
Stop batcher for several hours (or batcher run out of money in our case as showed in below monitor)

Start batcher (or fund to batcher in our case at 03-01 10:43, and batcher ran out of money soon, so we fund to it again at 03-01 11:29)
check the log of op-node, we found frameLength is too large error log at about 03-01 10:43 and 03-01 11:29, which is the time batcher return to work

nolanxyg · 2023-03-01T09:19:49Z

Quick feedback on your rollup config:

a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).

why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.

Looking forward to your feedback on the new batcher!

got, how about set op_batcher__channel_timeout to 40*12/3 and channel_timieout in rollup.json to 120*12/3?
- 12 means Ethereum block time , and 3 means BSC block time
the seq_window_size value in Ethereum environment is 3600, we considered BSC block time is 3s, so 3600*12/3 = 14400 ≈ 16800

protolambda · 2023-03-01T19:27:43Z

We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?

No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.

If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.

sebastianst · 2023-03-01T22:24:26Z

Quick feedback on your rollup config:

a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).

why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.
Looking forward to your feedback on the new batcher!
* got, how about set `op_batcher__channel_timeout` to `40*12/3` and ` channel_timieout` in rollup.json to `120*12/3`?
  
  * `12` means Ethereum block time , and `3` means BSC block time

* the `seq_window_size` value in Ethereum environment is [3600](https://github.com/ethereum-optimism/optimism/blob/3177f8850ed94c7935032155f94dd2629d809f96/op-node/chaincfg/chains.go#L62), we considered BSC block time is 3s, so `3600*12/3 = 14400 ≈ 16800`

The new batcher queries the channel timeout from its rollup node, so this parameter (OP_BATCHER_CHANNEL_TIMEOUT) is obsolete.
A channel timeout of 480 in your rollup.json sounds reasonable then. 3600*4=16800 for the SWS should be fine too.
Did you try the batcher from op-batcher: Add channel duration limiting #4990 (commit c8f16b5) yet? Try it with this config

      OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
      OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
      OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
      OP_BATCHER_TARGET_NUM_FRAMES: 1
      OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
      OP_BATCHER_SUB_SAFETY_MARGIN: 20

The high target and max tx size make sure that the batcher sends full frames when it's catching up after being offline for some time.

sebastianst · 2023-03-01T22:43:09Z

@nolanxyg I looked at your logs screenshot. It's conspicuous that the error at 10:43 mentions a frame number of 14. Because the batcher should actually just send txs whose data only includes 1-2 frame(s) with your config (OP_BATCHER_TARGET_NUM_FRAMES: 1). So it looks to me like the frame data is total garbage. But then it seems like a miracle that parsing gets until "frame 14" and the 13 "frames" before that unmarshal without error. Also, the first byte of the tx data must be correctly set to 0 = DerivationVersion0.

I'm curious if you still get errors with the #4990 batcher. Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?

nolanxyg · 2023-03-02T06:59:42Z

No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.

If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.

Nice explanation, understood, thanks! @protolambda 👍

We set l2_block_time to 1s considering there's no consensus on l2 for now, so we tried a aggressive value. Does 1s block time have potential risks? (If so, we'll change back to 2s)

nolanxyg · 2023-03-02T07:06:37Z

Did you try the batcher from #4990 (commit c8f16b5) yet? Try it with this config

     OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
     OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
     OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
     OP_BATCHER_TARGET_NUM_FRAMES: 1
     OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
     OP_BATCHER_SUB_SAFETY_MARGIN: 20

OK, we'll try this, and give the feedback soon

Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?

Sure

sebastianst · 2023-03-02T09:05:07Z

@nolanxyg #4990 just got merged into develop, so you can use the batcher from develop now.

sebastianst · 2023-03-02T09:22:14Z

I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?

nolanxyg · 2023-03-02T10:25:04Z

If to be the same with our steps, i think you could spin up a network, and test everything worked properly(batcher worked fine as well, but the balance of batcher better be low), produce a good amount of L2 tx volume, then batcher will be out of money soon, after a few hours(e.g. a night), fund to batcher to get it back to work, finally check if the network recovered and if the frame error log show up

Because i doubt that maybe the interrupt of sending submit-tx(due to insufficient funds) cause the error

ERROR[02-26|09:11:42.718] unable to publish transaction            service=batcher txHash=9aadf5..d71a8f nonce=55244 gasTipCap=20,000,000,000 gasFeeCap=20,000,000,000 err="insufficient funds for gas * price + value"

s7v7nislands · 2023-03-03T14:13:19Z

@sebastianst I think the pr fix the bug, please take a look

s7v7nislands · 2023-03-03T14:25:41Z

I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?

You can change the code in txmgr.go, always return an error in SendTransaction(), and print the data in nextTxData, you will find the version byte is repeated repeatedly.

sebastianst · 2023-03-03T14:25:55Z

Thanks! That should fix it indeed! I'll also take a closer look whether it makes sense to clean this up with a small refactor, because this prepending and appending of version zeros is quite error prone 😅

Regression test for #4937

protolambda assigned sebastianst Feb 22, 2023

s7v7nislands mentioned this issue Mar 3, 2023

op-batcher/batcher/channel_manager: when send tx failed, the frame should strip the version #5050

Merged

mslipper closed this as completed in #5050 Mar 4, 2023

sebastianst added a commit that referenced this issue Mar 11, 2023

op-batcher: Test requeuing of transactions

81047fd

Regression test for #4937

sebastianst added a commit that referenced this issue Mar 11, 2023

op-batcher: Test requeuing of transactions

c9c3502

Regression test for #4937

sebastianst added a commit that referenced this issue Mar 11, 2023

op-batcher: Test requeuing of transactions

4e9f5cd

Regression test for #4937

sebastianst added a commit that referenced this issue Mar 11, 2023

op-batcher: Test requeuing of transactions

6661bb2

Regression test for #4937

sebastianst added a commit that referenced this issue Mar 13, 2023

op-batcher: Test requeuing of transactions

891ac7f

Regression test for #4937

sebastianst added a commit that referenced this issue Mar 14, 2023

op-batcher: Test requeuing of transactions

6d862ab

Regression test for #4937

sebastianst added a commit that referenced this issue Mar 14, 2023

op-batcher: Test requeuing of transactions

4df9505

Regression test for #4937

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252` #4937

[BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252` #4937

nolanxyg commented Feb 22, 2023

protolambda commented Feb 22, 2023

sebastianst commented Feb 22, 2023

sebastianst commented Feb 23, 2023

sebastianst commented Feb 25, 2023

nolanxyg commented Feb 26, 2023

nolanxyg commented Feb 27, 2023

sebastianst commented Feb 27, 2023 •

edited

Loading

nolanxyg commented Feb 28, 2023

sebastianst commented Feb 28, 2023 •

edited

Loading

nolanxyg commented Mar 1, 2023

nolanxyg commented Mar 1, 2023

protolambda commented Mar 1, 2023

sebastianst commented Mar 1, 2023

sebastianst commented Mar 1, 2023 •

edited

Loading

nolanxyg commented Mar 2, 2023

nolanxyg commented Mar 2, 2023

sebastianst commented Mar 2, 2023

sebastianst commented Mar 2, 2023

nolanxyg commented Mar 2, 2023

s7v7nislands commented Mar 3, 2023

s7v7nislands commented Mar 3, 2023

sebastianst commented Mar 3, 2023

[BUG] op-node parse L1 batch data error: frameLength is too large: 2838834252 #4937

[BUG] op-node parse L1 batch data error: frameLength is too large: 2838834252 #4937

Comments

nolanxyg commented Feb 22, 2023

protolambda commented Feb 22, 2023

sebastianst commented Feb 22, 2023

sebastianst commented Feb 23, 2023

sebastianst commented Feb 25, 2023

nolanxyg commented Feb 26, 2023

nolanxyg commented Feb 27, 2023

sebastianst commented Feb 27, 2023 • edited Loading

tl;dr

nolanxyg commented Feb 28, 2023

sebastianst commented Feb 28, 2023 • edited Loading

nolanxyg commented Mar 1, 2023

nolanxyg commented Mar 1, 2023

protolambda commented Mar 1, 2023

sebastianst commented Mar 1, 2023

sebastianst commented Mar 1, 2023 • edited Loading

nolanxyg commented Mar 2, 2023

nolanxyg commented Mar 2, 2023

sebastianst commented Mar 2, 2023

sebastianst commented Mar 2, 2023

nolanxyg commented Mar 2, 2023

s7v7nislands commented Mar 3, 2023

s7v7nislands commented Mar 3, 2023

sebastianst commented Mar 3, 2023

[BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252` #4937

[BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252` #4937

sebastianst commented Feb 27, 2023 •

edited

Loading

sebastianst commented Feb 28, 2023 •

edited

Loading

sebastianst commented Mar 1, 2023 •

edited

Loading