Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lodestar stopped keeping up with head on mainnet #6861

Open
concentricspheres opened this issue Jun 6, 2024 · 32 comments
Open

Lodestar stopped keeping up with head on mainnet #6861

concentricspheres opened this issue Jun 6, 2024 · 32 comments
Labels
meta-bug Issues that identify a bug and require a fix. meta-investigate Issues found that require further investigation and may not have a specific resolution/fix prio-high Resolve issues as soon as possible.

Comments

@concentricspheres
Copy link

Describe the bug

Lodestar stopped keeping up with head on mainnet after block ~ 19972650

Expected behavior

To keep up with head

Steps to reproduce

No response

Additional context

consensus.log.gz

Operating system

Linux

Lodestar version or commit hash

v1.18.1/eefb711

@concentricspheres concentricspheres added the meta-bug Issues that identify a bug and require a fix. label Jun 6, 2024
@nflaig
Copy link
Member

nflaig commented Jun 6, 2024

Two observations from the logs

  1. seems like your EL was still syncing before Lodestar started to fall behind
Jun-06 15:41:41.000[]                 �[32minfo�[39m: Synced - slot: 9238706 - head: 0x4901…b19e - exec-block: syncing(20033746 0x9174…) - finalized: 0x6724…7027:288707 - peers: 66
Jun-06 15:41:29.000[]                 �[32minfo�[39m: Synced - slot: 9238705 - head: 0xf8f2…6ddd - exec-block: syncing(20033745 0x538c…) - finalized: 0x6724…7027:288707 - peers: 62
Jun-06 15:41:17.000[]                 �[32minfo�[39m: Synced - slot: 9238704 - head: 0x165e…6111 - exec-block: syncing(20033744 0x40c5…) - finalized: 0x6724…7027:288707 - peers: 67
Jun-06 15:41:05.000[]                 �[32minfo�[39m: Synced - slot: 9238703 - head: 0x1a85…16df - exec-block: syncing(20033743 0x52c6…) - finalized: 0x6724…7027:288707 - peers: 68
  1. somehow the validator client is still submitting attestations even though your node was no longer synced, any chance you have multiple beaocon nodes running?

From the logs, it's hard to tell why it started to fall behind, it's not a general issue for this block as my node is still synced, might require debug logs to figure out what's going on.

If your node is still not back to head, you might wanna force a checkpoint sync it by passing

--checkpointSyncUrl <url> --forceCheckpointSync

@nflaig nflaig added the meta-investigate Issues found that require further investigation and may not have a specific resolution/fix label Jun 6, 2024
@yorickdowne
Copy link

yorickdowne commented Jun 6, 2024

Hey @nflaig . We've had this happen on two nodes in our Lido clusters. Yes, we run multiple CL:EL, the VC is a Vouch.

Lodestar started syncing after we restarted it. We didn't want to wait so we checkpoint synced to get back on track.

So, this is definitely a thing, it happened on different blocks on two nodes.

@claravanstaden
Copy link
Contributor

We are running Lodestar 1.16.0 and having the exact same issue. We will upgrade soon but it looks like this is still an issue on newer versions. This issue is fairly devastating for our use case as it increasing bridging latency. It also occurs silently with no error or warning in the logs.

For now, we have a script that compares Lodestar's finalized slot with the finalized slot on Beaconcha.in and restarts the node if it falls behind too far, but this obviously needs to be resolved on root cause.

@philknows philknows added the prio-high Resolve issues as soon as possible. label Jun 18, 2024
@philknows
Copy link
Member

philknows commented Jun 18, 2024

Hey all, when this happens next time, would you be able to capture and send the relevant debug logs for us to take a look at? Please also send us the hardware that you're running on to help augment those logs. We need to compile more information here on where to start digging since it seems intermittent and we don't have a clear indicator yet on what might be causing this. Any data would be helpful! Thanks.

@yorickdowne
Copy link

We have one node, lido3-c, with debug logs enabled. We're waiting for that one in particular to show the issue so we can get you debug logs.

We have since updated Lodestar to 1.19.0 as well, so it's possible we won't see the issue again

@nflaig
Copy link
Member

nflaig commented Jun 18, 2024

We have one node, lido3-c, with debug logs enabled.

the default log level to file is debug, unless you have explicitly changed that all nodes will have those available

@yorickdowne
Copy link

Ah, OK. Thanks. I see beacon-timestamp.log in the datadir. We'll get you those when next we see a failure

@claravanstaden
Copy link
Contributor

Thanks @philknows, I'll do the same. My file logging was set to info but I have changed it to debug.

@yorickdowne
Copy link

Possibly adjacent issue - one of our Lodestars would not catch up, Syncing but kept falling further behind. A restart fixed it.

Logs attached.

beacon-2024-06-21.log.gz

@wemeetagain
Copy link
Member

@yorickdowne it looks like your EL may have been down during that time?
During the Syncing part of the logs seeing lots of this error:

Batch process error id=Finalized, startEpoch=292046, status=Processing, code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Request to http://execution:8551 failed, reason: connect ECONNREFUSED 172.18.0.7:8551

which happens when we try to verify execution payloads of blocks that we're syncing.

Somewhat unrelated, one thing we're doing that could be improved, we throw away the batches of blocks we've downloaded when that error happens, and then we have to re-download the batches again and again. This isn't helpful as then we keep getting rate-limited and that causes us to really churn through peers.

@yorickdowne
Copy link

Yes I believe the EL had an issue. When restarting the stack, the EL did not shut down cleanly and was reaped after 5m timeout by Docker

@claravanstaden
Copy link
Contributor

Hey all, when this happens next time, would you be able to capture and send the relevant debug logs for us to take a look at? Please also send us the hardware that you're running on to help augment those logs. We need to compile more information here on where to start digging since it seems intermittent and we don't have a clear indicator yet on what might be causing this. Any data would be helpful! Thanks.

@nflaig It happened again a few times in the last few days. I am attaching today's log. We have a script to check if Lodestar falls more than an epoch behind, then it restarts the node. The node was restarted at 06:21, so I would assume the issue is just before that.

beacon-2024-06-26.log.gz

@claravanstaden
Copy link
Contributor

I have upgraded to Lodestar 1.19.0. Will see if the issue still occurs.

@nflaig
Copy link
Member

nflaig commented Jun 29, 2024

@claravanstaden Thanks for the logs

The node was restarted at 06:21, so I would assume the issue is just before that.

Looking at the logs around the time the node was restarted, I don't see any sync issues, the last log right before is

Synced - slot: 9379860 - head: 0x8c74…355d - exec-block: syncing(20173993 0x7ddd…) - finalized: 0xdf2b…200b:293118 - peers: 12

The imported chain head seems to be the canonical block as well

New chain head slot=9379861, root=0xa301b49fd3c17df04f098b633b475d95a01d92a9906a1288d634e3bd73dc6b8a, delaySec=2.385999917984009

how is your script determining that the node is behind by one epoch?

Another observation from the logs is that the node is really struggling with keeping a high peer count.

@nflaig
Copy link
Member

nflaig commented Jun 29, 2024

@yorickdowne The EL was down for quite some time and Lodestar is not importing blocks in that case causing it to fall behind. This behavior seems consistent with other clients. What would be concerning is if the node can't get back to head after EL is online again. I tried this a few times but the node is always able to get back to head quite quickly.

The logs are a bit confusing though as it prints -45000000000 seconds left - -8.25e-9 slots/s which does not make much sense, what we rather wanna log is something like Syncing - ? left - 0.00 slots/s indicating that node is not syncing.

And we are pretty silent about the EL being offline after the first error, we might wanna warn about this at least every epoch, this would make it clearer why its not keeping up with head.

@yorickdowne
Copy link

I think you have it: The issue here was that Lodestar did not catch back up when the EL was back online.

Because it’s so intermittent, it’s hard to track down. We haven’t seen it in a while - it’s entirely possible that “whatever it was” is now either resolved or much less likely since 1.19

@claravanstaden
Copy link
Contributor

@nflaig I have the seen the same issue with Lodestar 1.19.0. Adding some logs here, with restart timestamps:

Jun 28 12:48 beacon-2024-06-28.log.gz
Jun 29 16:12 beacon-2024-06-29.log.gz
Jun 30 19:24 beacon-2024-06-30.log.gz

how is your script determining that the node is behind by one epoch?

I compare it with the finalized epoch on BeaconCha.in. I have changed the script to restart if it is more than 2 epochs behind. Maybe that will give a clearer indication. Perhaps we'll see less restarts then too. I can't increase the latency more, otherwise it will influence bridge transfer times.

Another observation from the logs is that the node is really struggling with keeping a high peer count.

Interesting, maybe that could be it?

@nflaig
Copy link
Member

nflaig commented Jul 1, 2024

Adding some logs here, with restart timestamps:

I have looked at all three logs and right before the restart the node was synced, the only time I can see the node syncing is right after the restart which is expected.

How do you determine what's the finalized epoch in Lodestar? Maybe there is a bug somewhere else but the node is definitely not falling behind based on the logs, the head is correct before the restart.

Might be useful to track which epoch is flagged by your tooling to restart Lodestar, then we can look at that epoch specifically.

Interesting, maybe that could be it?

No the sync is still fine and the node is keeping up, just not seeing that on any of my nodes, peer count is usually between 100-105.

I am not sure how much impact it has if EL is reporting syncing on peering. But if you don't run the node with attached validators it's not that important ot have stable peers as long as node is running fine and keeping up.

@claravanstaden
Copy link
Contributor

@nflaig thanks for the feedback. I'll keep an eye on it and let you know if I see it restarting after falling behind 2 epochs. Maybe the 1 epoch was too aggressive and it was restarting without there being a problem.

How do you determine what's the finalized epoch in Lodestar? Maybe there is a bug somewhere else but the node is definitely not falling behind based on the logs, the head is correct before the restart.

I am using this endpoint: curl http://localhost:9596/eth/v1/beacon/light_client/finality_update - using data.finalized_header.beacon.slot

Might be useful to track which epoch is flagged by your tooling to restart Lodestar, then we can look at that epoch specifically.

Will note this down if it happens again.

I am not sure how much impact it has if EL is reporting syncing on peering.

I am running without a EL node: --execution.engineMock --eth1 false

@claravanstaden
Copy link
Contributor

@nflaig it happened again last night, with the changed setting to allow 2 epochs difference. Please let me know if you can see any issue.

1 July 2024 22:48
beacon-2024-07-01.log.gz

@nflaig
Copy link
Member

nflaig commented Jul 3, 2024

@claravanstaden I still need to review our code more but my guess right now is that it's an issue with the light client server.

Looking at the timestamps of your restarts, this shows an interesting pattern

Jun 28 12:48 beacon-2024-06-28.log.gz
Jun 29 16:12 beacon-2024-06-29.log.gz
Jun 30 19:24 beacon-2024-06-30.log.gz
Jul 1 22:48 beacon-2024-07-01.log.gz

Looks like the restart happens every ~27 hours which is the sync committee period. My suspicion is that we are not updating the finalized update in the light client server if sync committee changes.

Possible related to that, there are a lot of attestedData not available logs which are from the light client server as well.

with the changed setting to allow 2 epochs difference

If you set it to 3 epochs it would not happen I think, but anyways, restarting Lodestar is pointless here, if you wait 3 epochs it will report the correct finality update again from new sync committee period.

@claravanstaden
Copy link
Contributor

@nflaig without the script, the actual block syncing of the Lodestar would fall behind until it was restarted again. Sometimes it would be days behind. So from what I have seen, its not just the light client server though. I could also change the script to use the standard beacon API instead of the light client endpoints. Let me do that.

@nflaig
Copy link
Member

nflaig commented Jul 3, 2024

without the script, the actual block syncing of the Lodestar would fall behind until it was restarted again

This is not clear from the logs, the node was synced to head when it was restarted. It would be best to use head slot from /eth/v1/node/syncing to determine if node is syncing or not, and could check head_slot against an external source like beaconcha.in.

Looking at https://lodestar-mainnet.chainsafe.io/eth/v1/beacon/light_client/finality_update from our mainnet node, this seems to be correct and the node is running for weeks.

Sometimes it would be days behind. So from what I have seen, its not just the light client server though.

I am bit skeptical about this, I am running multiple Lodestar nodes myself for years, and the only time so far any of my nodes was falling behind is when my EL database got corrupted and EL went offline. Not saying it's impossible, but from my experience unlikely. Maybe related to engine mock (atlhough I don't see the correlation), but for use cases outside of development or testing, I would recommend to run a execution client.

I have opened an issue to track the light client server problem #6932

@claravanstaden
Copy link
Contributor

@nflaig I have changed the script to use the main beacon API endpoint. It hasn't restarted the service in a week now, so I'd say it is safe to close this issue, thank you. It looks like the issue was resolved in Lodestar 1.19.0. Thanks again for the assistance!

@nflaig
Copy link
Member

nflaig commented Jul 29, 2024

Thanks for testing, let's close this and reopen another issue if needed

@nflaig nflaig closed this as completed Jul 29, 2024
@claravanstaden
Copy link
Contributor

Hey @nflaig, I came across this issue again yesterday. I changed our restart script to compare https://beaconcha.in/api/v1/epoch/finalized with http://localhost:9596/eth/v1/beacon/states/finalized/finality_checkpoints (using data.finalized.epoch). The script only restarts the node if the diff is more than 4 epochs.

When the script restarted Lodestar, this was the state:
Beaconcha.in Finalized Epoch: 300835
Lodestar Finalized Epoch: 300829

Please let me know if you see anything in the logs. Lodestar was restarted on Jul 30 13:24.

beacon-2024-07-30.log.gz

@nflaig
Copy link
Member

nflaig commented Jul 31, 2024

I haven't looked into the logs yet but if you wanna use GET /eth/v1/beacon/states/{state_id}/finality_checkpoints to determine what's the currently finalized epoch you need to query by head and not finalized state id another rather check for 2 epochs diff.

Regarding the 4 diff, it depends on when you queried it and when Lodestar vs. Beaconcha.in updated the finalized epoch. It should happen roughly at the same time but might not be exactly the same.

I don't think in general it's a good approach to compare vs. external data, Lodestar itself knows if it's synced or not already so rather use GET /eth/v1/node/syncing by checking is_syncing=true or you can use the head_slot. For example, we have a grafana alert that fires if the head slot hasn't advanced for 5mins.

@claravanstaden
Copy link
Contributor

I haven't looked into the logs yet but if you wanna use GET /eth/v1/beacon/states/{state_id}/finality_checkpoints to determine what's the currently finalized epoch you need to query by head and not finalized state id another rather check for 2 epochs diff.

This might be true but for 99% of the time, these two values are the same or the difference is 1 epoch.

Regarding the 4 diff, it depends on when you queried it and when Lodestar vs. Beaconcha.in updated the finalized epoch. It should happen roughly at the same time but might not be exactly the same.

6 epochs difference indicates a major gap, though, not just a timing issue.

I don't think in general it's a good approach to compare vs. external data, Lodestar itself knows if it's synced or not already so rather use GET /eth/v1/node/syncing by checking is_syncing=true or you can use the head_slot. For example, we have a grafana alert that fires if the head slot hasn't advanced for 5mins.

I disagree with this because the network should have the same view of what is the current finalized block. I can change this script to keep track of the head slot, but please let me know if you see anything in the logs that could have caused this.

@nflaig
Copy link
Member

nflaig commented Jul 31, 2024

6 epochs difference indicates a major gap, though, not just a timing issue.

ah right it's 6 epochs, based on the logs your node had 2-3 peers for the most time and then at some point dropped down to 0 and was no longer able to sync, so that explains the disparity.

I disagree with this because the network should have the same view of what is the current finalized block

A node usually has ~100 peers so it has a pretty good view on the network state based on data from other peers.

It's bit hard to tell why your node is struggling so much to have stable peers, could be many reasons, like low bandwidth. What's notable from the logs is that is has 0 inbound peers, did you open port 9000 (udp/tcp)? Usually a healthy node will have mostly inbound peers after a while.

It would be good to collect some metrics to above a more broad view of how the node is performing over time.

@claravanstaden
Copy link
Contributor

@nflaig Thanks for getting back to me. OK yeah I can look at port 9000, but I think what concerns me is why does Lodestar not find new peers to connect to, and that a restart fixes it?

@nflaig
Copy link
Member

nflaig commented Jul 31, 2024

I think what concerns me is why does Lodestar not find new peers to connect to, and that a restart fixes it?

Not sure what's going on, but I would also suggest to run a proper EL client, otherwise your node will forward any block even if it's invalid which will get it banned by other peers. On restart, it get assigned a new peer id which could solve this, but I am not that deep into peer scoring / networking to give you proper reasoning on why this is happening.

@philknows
Copy link
Member

We'll keep using this thread to debug this. Sorry, thought there was another new issue opened regarding this.

@philknows philknows reopened this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix. meta-investigate Issues found that require further investigation and may not have a specific resolution/fix prio-high Resolve issues as soon as possible.
Projects
None yet
Development

No branches or pull requests

6 participants