-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lodestar stopped keeping up with head on mainnet #6861
Comments
Two observations from the logs
Jun-06 15:41:41.000[] �[32minfo�[39m: Synced - slot: 9238706 - head: 0x4901…b19e - exec-block: syncing(20033746 0x9174…) - finalized: 0x6724…7027:288707 - peers: 66
Jun-06 15:41:29.000[] �[32minfo�[39m: Synced - slot: 9238705 - head: 0xf8f2…6ddd - exec-block: syncing(20033745 0x538c…) - finalized: 0x6724…7027:288707 - peers: 62
Jun-06 15:41:17.000[] �[32minfo�[39m: Synced - slot: 9238704 - head: 0x165e…6111 - exec-block: syncing(20033744 0x40c5…) - finalized: 0x6724…7027:288707 - peers: 67
Jun-06 15:41:05.000[] �[32minfo�[39m: Synced - slot: 9238703 - head: 0x1a85…16df - exec-block: syncing(20033743 0x52c6…) - finalized: 0x6724…7027:288707 - peers: 68
From the logs, it's hard to tell why it started to fall behind, it's not a general issue for this block as my node is still synced, might require debug logs to figure out what's going on. If your node is still not back to head, you might wanna force a checkpoint sync it by passing --checkpointSyncUrl <url> --forceCheckpointSync |
Hey @nflaig . We've had this happen on two nodes in our Lido clusters. Yes, we run multiple CL:EL, the VC is a Vouch. Lodestar started syncing after we restarted it. We didn't want to wait so we checkpoint synced to get back on track. So, this is definitely a thing, it happened on different blocks on two nodes. |
We are running Lodestar 1.16.0 and having the exact same issue. We will upgrade soon but it looks like this is still an issue on newer versions. This issue is fairly devastating for our use case as it increasing bridging latency. It also occurs silently with no error or warning in the logs. For now, we have a script that compares Lodestar's finalized slot with the finalized slot on Beaconcha.in and restarts the node if it falls behind too far, but this obviously needs to be resolved on root cause. |
Hey all, when this happens next time, would you be able to capture and send the relevant debug logs for us to take a look at? Please also send us the hardware that you're running on to help augment those logs. We need to compile more information here on where to start digging since it seems intermittent and we don't have a clear indicator yet on what might be causing this. Any data would be helpful! Thanks. |
We have one node, lido3-c, with debug logs enabled. We're waiting for that one in particular to show the issue so we can get you debug logs. We have since updated Lodestar to 1.19.0 as well, so it's possible we won't see the issue again |
the default log level to file is |
Ah, OK. Thanks. I see beacon-timestamp.log in the datadir. We'll get you those when next we see a failure |
Thanks @philknows, I'll do the same. My file logging was set to |
Possibly adjacent issue - one of our Lodestars would not catch up, Logs attached. |
@yorickdowne it looks like your EL may have been down during that time?
which happens when we try to verify execution payloads of blocks that we're syncing. Somewhat unrelated, one thing we're doing that could be improved, we throw away the batches of blocks we've downloaded when that error happens, and then we have to re-download the batches again and again. This isn't helpful as then we keep getting rate-limited and that causes us to really churn through peers. |
Yes I believe the EL had an issue. When restarting the stack, the EL did not shut down cleanly and was reaped after 5m timeout by Docker |
@nflaig It happened again a few times in the last few days. I am attaching today's log. We have a script to check if Lodestar falls more than an epoch behind, then it restarts the node. The node was restarted at 06:21, so I would assume the issue is just before that. |
I have upgraded to Lodestar 1.19.0. Will see if the issue still occurs. |
@claravanstaden Thanks for the logs
Looking at the logs around the time the node was restarted, I don't see any sync issues, the last log right before is
The imported chain head seems to be the canonical block as well
how is your script determining that the node is behind by one epoch? Another observation from the logs is that the node is really struggling with keeping a high peer count. |
@yorickdowne The EL was down for quite some time and Lodestar is not importing blocks in that case causing it to fall behind. This behavior seems consistent with other clients. What would be concerning is if the node can't get back to head after EL is online again. I tried this a few times but the node is always able to get back to head quite quickly. The logs are a bit confusing though as it prints And we are pretty silent about the EL being offline after the first error, we might wanna warn about this at least every epoch, this would make it clearer why its not keeping up with head. |
I think you have it: The issue here was that Lodestar did not catch back up when the EL was back online. Because it’s so intermittent, it’s hard to track down. We haven’t seen it in a while - it’s entirely possible that “whatever it was” is now either resolved or much less likely since 1.19 |
@nflaig I have the seen the same issue with Lodestar 1.19.0. Adding some logs here, with restart timestamps: Jun 28 12:48 beacon-2024-06-28.log.gz
I compare it with the finalized epoch on BeaconCha.in. I have changed the script to restart if it is more than 2 epochs behind. Maybe that will give a clearer indication. Perhaps we'll see less restarts then too. I can't increase the latency more, otherwise it will influence bridge transfer times.
Interesting, maybe that could be it? |
I have looked at all three logs and right before the restart the node was synced, the only time I can see the node syncing is right after the restart which is expected. How do you determine what's the finalized epoch in Lodestar? Maybe there is a bug somewhere else but the node is definitely not falling behind based on the logs, the head is correct before the restart. Might be useful to track which epoch is flagged by your tooling to restart Lodestar, then we can look at that epoch specifically.
No the sync is still fine and the node is keeping up, just not seeing that on any of my nodes, peer count is usually between 100-105. I am not sure how much impact it has if EL is reporting syncing on peering. But if you don't run the node with attached validators it's not that important ot have stable peers as long as node is running fine and keeping up. |
@nflaig thanks for the feedback. I'll keep an eye on it and let you know if I see it restarting after falling behind 2 epochs. Maybe the 1 epoch was too aggressive and it was restarting without there being a problem.
I am using this endpoint:
Will note this down if it happens again.
I am running without a EL node: |
@nflaig it happened again last night, with the changed setting to allow 2 epochs difference. Please let me know if you can see any issue. 1 July 2024 22:48 |
@claravanstaden I still need to review our code more but my guess right now is that it's an issue with the light client server. Looking at the timestamps of your restarts, this shows an interesting pattern
Looks like the restart happens every ~27 hours which is the sync committee period. My suspicion is that we are not updating the finalized update in the light client server if sync committee changes. Possible related to that, there are a lot of
If you set it to 3 epochs it would not happen I think, but anyways, restarting Lodestar is pointless here, if you wait 3 epochs it will report the correct finality update again from new sync committee period. |
@nflaig without the script, the actual block syncing of the Lodestar would fall behind until it was restarted again. Sometimes it would be days behind. So from what I have seen, its not just the light client server though. I could also change the script to use the standard beacon API instead of the light client endpoints. Let me do that. |
This is not clear from the logs, the node was synced to head when it was restarted. It would be best to use head slot from Looking at https://lodestar-mainnet.chainsafe.io/eth/v1/beacon/light_client/finality_update from our mainnet node, this seems to be correct and the node is running for weeks.
I am bit skeptical about this, I am running multiple Lodestar nodes myself for years, and the only time so far any of my nodes was falling behind is when my EL database got corrupted and EL went offline. Not saying it's impossible, but from my experience unlikely. Maybe related to engine mock (atlhough I don't see the correlation), but for use cases outside of development or testing, I would recommend to run a execution client. I have opened an issue to track the light client server problem #6932 |
@nflaig I have changed the script to use the main beacon API endpoint. It hasn't restarted the service in a week now, so I'd say it is safe to close this issue, thank you. It looks like the issue was resolved in Lodestar 1.19.0. Thanks again for the assistance! |
Thanks for testing, let's close this and reopen another issue if needed |
Hey @nflaig, I came across this issue again yesterday. I changed our restart script to compare https://beaconcha.in/api/v1/epoch/finalized with http://localhost:9596/eth/v1/beacon/states/finalized/finality_checkpoints (using When the script restarted Lodestar, this was the state: Please let me know if you see anything in the logs. Lodestar was restarted on Jul 30 13:24. |
I haven't looked into the logs yet but if you wanna use Regarding the 4 diff, it depends on when you queried it and when Lodestar vs. Beaconcha.in updated the finalized epoch. It should happen roughly at the same time but might not be exactly the same. I don't think in general it's a good approach to compare vs. external data, Lodestar itself knows if it's synced or not already so rather use |
This might be true but for 99% of the time, these two values are the same or the difference is 1 epoch.
6 epochs difference indicates a major gap, though, not just a timing issue.
I disagree with this because the network should have the same view of what is the current finalized block. I can change this script to keep track of the head slot, but please let me know if you see anything in the logs that could have caused this. |
ah right it's 6 epochs, based on the logs your node had 2-3 peers for the most time and then at some point dropped down to 0 and was no longer able to sync, so that explains the disparity.
A node usually has ~100 peers so it has a pretty good view on the network state based on data from other peers. It's bit hard to tell why your node is struggling so much to have stable peers, could be many reasons, like low bandwidth. What's notable from the logs is that is has 0 inbound peers, did you open port 9000 (udp/tcp)? Usually a healthy node will have mostly inbound peers after a while. It would be good to collect some metrics to above a more broad view of how the node is performing over time. |
@nflaig Thanks for getting back to me. OK yeah I can look at port 9000, but I think what concerns me is why does Lodestar not find new peers to connect to, and that a restart fixes it? |
Not sure what's going on, but I would also suggest to run a proper EL client, otherwise your node will forward any block even if it's invalid which will get it banned by other peers. On restart, it get assigned a new peer id which could solve this, but I am not that deep into peer scoring / networking to give you proper reasoning on why this is happening. |
We'll keep using this thread to debug this. Sorry, thought there was another new issue opened regarding this. |
Describe the bug
Lodestar stopped keeping up with head on mainnet after block ~ 19972650
Expected behavior
To keep up with head
Steps to reproduce
No response
Additional context
consensus.log.gz
Operating system
Linux
Lodestar version or commit hash
v1.18.1/eefb711
The text was updated successfully, but these errors were encountered: