-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.10: 'solana-replay-stage' panicked at 'slot=XXX must exist in ProgressMap', core/src/progress_map.rs:576:32 #26069
Comments
What is the expected behavior if the slot is not in ProgressMap? |
Testnet is currently throwing a whole bunch of similar errors. From a brief examination it seems that this error only occurs when the validator crashed for a different reason beforehand. One of the testnet validators I maintain (on version 10.27) was also affected. It crashed 4 times, 2x with the quic panic, 2x in panicked with this progressmap panic. The crash might corrupt the ledger somehow, so that the validator can't resume cleanly? I had to delete ledger and start from a trusted snapshot to recover. I have retained ledger + logs, in case you need anything. Because I was asked on discord by @steviez: Normally I get 1-2 Logs ExcerptsThe full log is ~30GB in size, so here are some excerpts from my attempts to debug.
|
Thank you for posting @tlambertz - we haven't seen this on any of our nodes (AFAIK) and have been trying to get more details. I think the QUIC panics are understood and others have corrective PR's (and backports) on the way to 1.10 (#26272 and #26073); these two will likely prompt a release shortly after they land.
❤️ thank you! I'm trying to figure out if we can harden against the corruption / prevent scenario that got us to the On the other hand, if you're willing and could easily share the 30 GB log file, I'd be happy to download a copy so I don't have to bug you. |
@godmodegalactus - In this scenario, we think the slot should absolutely be in the ProgressMap. The fact that it isn't means our assumptions about state aren't valid / some state is corrupted, hence the hard panic |
Do you have a nice way to share huge files? I've uploaded it to a hetzner storage box and added your ssh keys (from https://github.com/steviez.keys). You should be able to pull the log with The full ledger is uploading right now, but that'll take at least 2h, likely 5-10h. When it's done you can pull with
Feel free to add other ssh keys if you need (unfortunately no file editor on that storage box, but you can push a new authorized keys file: https://docs.hetzner.com/robot/storage-box/backup-space-ssh-keys) |
I saw that in @tlambertz logs, this line appears immediately before the crash:
While these are the lines the mention the slot the validator got stuck on:
This slot, 138775147, is a descendant of 138774900. So I believe what happened here is that the PoH start slot got purged from the progress map. But I am a little bit confused how this can happen because usually the dump_then_repair function checks if the slot is an ancestor of the current poh slot and skips purging if that is the case? Perhaps the ancestors got corrupted? |
Ah, the log lines above were not all the lines mentioning the crashing slot.
|
Could you also please add me? |
I have gone through the logs and included relevant pieces below; I focused on the first 2 panics as the validator was in a bad state after this point. The log statements show that the node was unable to properly align with cluster consensus after the panic. I will be downloading the ledger that tlambertz provided, and attempting to reproduce deviant behavior / compare against snapshot from other source such as our warehouse node. Summary of logs below:
|
I can recreate the invalid hash with A few notes:
|
A few updates:
From the two above, it seems that there is some state isn't being captured in the snapshot, or that the re-establishment of state from snapshot isn't working correctly. After brainstorming with @brooksprumo, I'm now going to try to directly compare banks, one from the original full snapshot with replay, and one from loading the snapshot directly. |
sort of: |
I figured this out this past weekend, PR on the way shortly after this comment. The issue is related to This is exactly what happened with the incremental snapshot slot (138773288) that tlambertz provided me as some logging shows:
So, with this in mind, we can fully explain the sequence of events that tlambertz observed:
The lamports_per_signature PR was backported to v1.10, but only the read aspect of it. That is, snapshots that contained the field (ie from a v1.11 node) could be read. However, the write aspect (storing value in created snapshots) wasn't backported. Backporting that aspect is the fix here. |
wao! mb hit congestion fees? |
Technically, all of the actual debugging and reproduction was done on a testnet snapshot + ledger, and I can only definitively say that we hit the elevated value on testnet. The initial issue report that was transcribed from Discord failed in the same way, but it is possible that something else could have caused the consensus deviation ... not enough logs provided to know with certainty |
bench tps hits it on testnet pretty often. uncovered a similar bug for fee_calculator storage in durable nonce accounts |
|
The following panic was reported on v1.10.25 (src:d64f6808; feat:965221688) by a mainnet validator:
The text was updated successfully, but these errors were encountered: