Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.10: 'solana-replay-stage' panicked at 'slot=XXX must exist in ProgressMap', core/src/progress_map.rs:576:32 #26069

Closed
mvines opened this issue Jun 19, 2022 · 17 comments · Fixed by #26394
Assignees

Comments

@mvines
Copy link
Member

mvines commented Jun 19, 2022

The following panic was reported on v1.10.25 (src:d64f6808; feat:965221688) by a mainnet validator:

un 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.036263075Z WARN  solana_core::replay_stage] purging slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.065577144Z WARN  solana_core::replay_stage] purging descendant: 138158378 with slot_id 508786, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.067850205Z WARN  solana_core::replay_stage] purging descendant: 138158377 with slot_id 508785, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.069361795Z WARN  solana_core::replay_stage] purging descendant: 138158379 with slot_id 508788, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.072372591Z WARN  solana_core::replay_stage] purging descendant: 138158376 with slot_id 508783, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.073289748Z WARN  solana_core::replay_stage] Notifying repair service to repair duplicate slot: 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: thread 'solana-replay-stage' panicked at 'slot=138158379 must exist in ProgressMap', core/src/progress_map.rs:576:32
Jun 19 14:55:13 mainnet solana-validator[2238461]: stack backtrace:
Jun 19 14:55:13 mainnet solana-validator[2238461]:    0: rust_begin_unwind
Jun 19 14:55:13 mainnet solana-validator[2238461]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
Jun 19 14:55:13 mainnet solana-validator[2238461]:    1: core::panicking::panic_fmt
Jun 19 14:55:13 mainnet solana-validator[2238461]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
Jun 19 14:55:13 mainnet solana-validator[2238461]:    2: solana_core::progress_map::ProgressMap::get_leader_propagation_slot_must_exist
Jun 19 14:55:13 mainnet solana-validator[2238461]:    3: solana_core::replay_stage::ReplayStage::retransmit_latest_unpropagated_leader_slot
Jun 19 14:55:13 mainnet solana-validator[2238461]:    4: solana_core::replay_stage::ReplayStage::new::{{closure}}
Jun 19 14:55:13 mainnet solana-validator[2238461]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.099064364Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-replay-stage" one=1i message="panicked at 'slot=138158379 must exist in ProgressMap', core/src/progress_map.rs:576:32" location="core/src/progress_map.rs:576:32" version="\"1.10.25 (src:d64f6808; feat:965221688)\""
Jun 19 14:55:26 mainnet systemd[1]: solana.service: Main process exited, code=killed, status=11/SEGV
@godmodegalactus
Copy link
Contributor

What is the expected behavior if the slot is not in ProgressMap?

@tlambertz
Copy link

Testnet is currently throwing a whole bunch of similar errors.
Cluster Telemetry (https://metrics.solana.com:3000/d/monitor-beta/cluster-telemetry-beta?orgId=1&refresh=5m&var-datasource=InfluxDB-testnet&var-testnet=tds&var-hostid=All&viewPanel=28&from=now-90d&to=now CSV export) shows them for many validators.

image

From a brief examination it seems that this error only occurs when the validator crashed for a different reason beforehand.

One of the testnet validators I maintain (on version 10.27) was also affected.

It crashed 4 times, 2x with the quic panic, 2x in panicked with this progressmap panic.

The crash might corrupt the ledger somehow, so that the validator can't resume cleanly?
After the 4 crashes, my validator stopped advancing roots. It was seeing new optimistic confirmations though.
It also was forking, but always based on a stuck root.
Log shows unusual amount of dropped vote VoteStateUpdate spam. Log fills at ~500MB/min.

I had to delete ledger and start from a trusted snapshot to recover. I have retained ledger + logs, in case you need anything.

Because I was asked on discord by @steviez: Normally I get 1-2 shred_insert_is_full logs per second. I also got one for the root the validator got stuck on, as well as the two slots the ProgressMap panic occured on.

Logs Excerpts

The full log is ~30GB in size, so here are some excerpts from my attempts to debug.
Let me know if you need anything else.

14:17:04 latest slot_stats_tracking_complete slot=138773557i is_rooted=true
14:17:33 latest: datapoint: tower-observed slot=138774295i root=138773759i
14:17:33 latest: datapoint: tower-vote latest=138774303i root=138773761i
14:17:34 latest optimistic slot log: `datapoint: optimistic_slot slot=138774303i`
14:17:36 latest new fork:138774312 parent:138774303 root:138773761
14:17:36 first quic panic

-- validator restart --

-- after this point, there are NO advancing "datapoint: tower-observed" logs, even over multiple restarts.
   Stuck on slot=138773288i root=138772909i

14:18:45 `dropped vote VoteStateUpdate {[...]} failed to match hash [..] [..]` spam starts
14:20:21 first ProgressMap panic, slot 138774303 (the latest pre-crash optimistic slot)

-- validator restart --
14:28:26 latest optimistic slot log: "datapoint: optimistic_slot slot=138775147i"
14:28:29 second quic panic

-- validator restart --
14:31:14 "bank frozen: 138775147"
14:31:15 "Starting validator with working bank slot 138775147"
14:31:44 "thread 'solana-replay-stage' panicked at 'slot=138775147 must exist in ProgressMap', core/src/progress_map.rs:576:32"

-- validator restart --
lots of optimistic confirmations, but no roots.
there is attempts to fork, but root always stays at 138773761
validator keeps freezing banks, last_blockhash matches cluster. eg
[2022-06-27T15:11:43.670636187Z INFO  solana_runtime::bank] bank frozen: 138778673 hash: HruSEcqsJtXqCqc3NDg9ZinnnXHSHANZz4d5DJ5D5Kmb accounts_delta: 68ggacFjKwwqxnEV9Fo
W3nqdHBhYJzCp8hfWP8BzFKuy signature_count: 3231 last_blockhash: 9n6dhiFW9QAKtNoktAzgBLq8YNCkXBHAUpQGdJyqne11 capitalization: 1503173458770019583

17046921:thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', streamer/src/nonblocking/quic.rs:420:65
17046942:[2022-06-27T14:17:36.766382216Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="tokio-runtime-worker" one=1i message="panicked at 'called `Option::unwrap()` on a `None` value', streamer/src/nonblocking/quic.rs:420:65" location="streamer/src/nonblocking/quic.rs:420:65" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

18504307:thread 'solana-replay-stage' panicked at 'slot=138774303 must exist in ProgressMap', core/src/progress_map.rs:576:32
18504317:[2022-06-27T14:20:21.776152518Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-replay-stage" one=1i message="panicked at 'slot=138774303 must exist in ProgressMap', core/src/progress_map.rs:576:32" location="core/src/progress_map.rs:576:32" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

20136751:thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', streamer/src/nonblocking/quic.rs:420:65
20136772:[2022-06-27T14:28:29.587161091Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="tokio-runtime-worker" one=1i message="panicked at 'called `Option::unwrap()` on a `None` value', streamer/src/nonblocking/quic.rs:420:65" location="streamer/src/nonblocking/quic.rs:420:65" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

22307659:thread 'solana-replay-stage' panicked at 'slot=138775147 must exist in ProgressMap', core/src/progress_map.rs:576:32
22307669:[2022-06-27T14:31:45.446569801Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-replay-stage" one=1i message="panicked at 'slot=138775147 must exist in ProgressMap', core/src/progress_map.rs:576:32" location="core/src/progress_map.rs:576:32" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

@steviez
Copy link
Contributor

steviez commented Jun 27, 2022

Thank you for posting @tlambertz - we haven't seen this on any of our nodes (AFAIK) and have been trying to get more details.

I think the QUIC panics are understood and others have corrective PR's (and backports) on the way to 1.10 (#26272 and #26073); these two will likely prompt a release shortly after they land.

I have retained ledger + logs, in case you need anything.

❤️ thank you! I'm trying to figure out if we can harden against the corruption / prevent scenario that got us to the must exist in ProgressMap error. Panics obviously aren't good, but it'd be better if we could recover from this situations without manual intervention like wiping the ledger. I need to poke around the code a little more, but then I might ask some followup questions about things in / not in the log.

On the other hand, if you're willing and could easily share the 30 GB log file, I'd be happy to download a copy so I don't have to bug you.

@steviez
Copy link
Contributor

steviez commented Jun 27, 2022

What is the expected behavior if the slot is not in ProgressMap?

@godmodegalactus - In this scenario, we think the slot should absolutely be in the ProgressMap. The fact that it isn't means our assumptions about state aren't valid / some state is corrupted, hence the hard panic

@tlambertz
Copy link

Do you have a nice way to share huge files? I've uploaded it to a hetzner storage box and added your ssh keys (from https://github.com/steviez.keys).

You should be able to pull the log with rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger/validator-log.log log.log

The full ledger is uploading right now, but that'll take at least 2h, likely 5-10h. When it's done you can pull with

rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger ledger

Feel free to add other ssh keys if you need (unfortunately no file editor on that storage box, but you can push a new authorized keys file: https://docs.hetzner.com/robot/storage-box/backup-space-ssh-keys)

@bennofs
Copy link
Contributor

bennofs commented Jun 28, 2022

I saw that in @tlambertz logs, this line appears immediately before the crash:

[2022-06-27T14:31:45.409639624Z WARN  solana_core::replay_stage] purging slot 138774900  

While these are the lines the mention the slot the validator got stuck on:

14:28:26 "datapoint: optimistic_slot slot=138775147i"
14:28:29 validator crash on quic
-- validator restart --
14:31:14 "bank frozen: 138775147"
14:31:15 "Starting validator with working bank slot 138775147"
14:31:44 "thread 'solana-replay-stage' panicked at 'slot=138775147 must exist in ProgressMap', core/src/progress_map.rs:576:32"

This slot, 138775147, is a descendant of 138774900. So I believe what happened here is that the PoH start slot got purged from the progress map. But I am a little bit confused how this can happen because usually the dump_then_repair function checks if the slot is an ancestor of the current poh slot and skips purging if that is the case? Perhaps the ancestors got corrupted?

@bennofs
Copy link
Contributor

bennofs commented Jun 28, 2022

Ah, the log lines above were not all the lines mentioning the crashing slot.
The slot definitely got purged from the progress map before:

[2022-06-27T14:31:44.877646234Z WARN  solana_core::replay_stage] purging descendant: 138775147 with slot_id 1044, of slot 138774901
thread 'solana-replay-stage' panicked at 'slot=138775147 must exist in ProgressMap', core/src/progress_map.rs:576:32

@godmodegalactus
Copy link
Contributor

Do you have a nice way to share huge files? I've uploaded it to a hetzner storage box and added your ssh keys (from https://github.com/steviez.keys).

You should be able to pull the log with rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger/validator-log.log log.log

The full ledger is uploading right now, but that'll take at least 2h, likely 5-10h. When it's done you can pull with

rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger ledger

Feel free to add other ssh keys if you need (unfortunately no file editor on that storage box, but you can push a new authorized keys file: https://docs.hetzner.com/robot/storage-box/backup-space-ssh-keys)

Could you also please add me?

@steviez
Copy link
Contributor

steviez commented Jun 28, 2022

I have gone through the logs and included relevant pieces below; I focused on the first 2 panics as the validator was in a bad state after this point. The log statements show that the node was unable to properly align with cluster consensus after the panic. I will be downloading the ledger that tlambertz provided, and attempting to reproduce deviant behavior / compare against snapshot from other source such as our warehouse node.

Summary of logs below:

  • The validator hits QUIC unwrap panic - the node looked to be healthy up to this point
  • Upon restart from the panic, the bank hash for the incremental snapshot matches results prior to panic
  • The bank hash immediately after incremental snapshot slot deviated from results prior to panic
  • At this point, all subsequent bank hashes deviate from replay results prior to panic
  • The validator hits second panic eventually, restarts, and shows same deviant behavior as first restart
#### First frozen bank in log - only listed to show how far back logs go
[2022-06-27T07:00:05.644267906Z INFO  solana_runtime::bank] bank frozen: 138732234 hash: h3Qrhsmw42zsaWWWLC4FCx6WTpQ2DvQwrCPcwz8WaYk accounts_delta: FkwiT2GkGPdLzAjzZx4AgbFbnYTL131FbCHbY4d7mosZ signature_count: 5412 last_blockhash: 4YdD5uosZByyauAqaTUuaqiQJ6kcffGDaErWf5BjnFvJ capitalization: 1502142775316697072

#### Frozen bank for where latest incremental snapshot taken
[2022-06-27T14:06:13.494071424Z INFO  solana_runtime::bank] bank frozen: 138773288 hash: GKGWXijGzAeHJesRmM3FWy1sqJEHBwQpmjRRxRaWuG9N accounts_delta: 9i1wPaWVco82VQycCaWJAmJ75Kbr3D7jPrrSwBVrVKWd signature_count: 4049 last_blockhash: 34Sr7xQNucjcgbAACnunH2rvjzLFWZ8VdxT65UUmuTWk capitalization: 1503173472978444308
[2022-06-27T14:06:13.494088316Z INFO  solana_runtime::bank] accounts hash slot: 138773288 stats: BankHashStats { num_updated_accounts: 7937, num_removed_accounts: 0, num_lamports_stored: 584568066570500347, total_data_len: 14621377, num_executable_accounts: 0 }

#### First frozen bank after incremental snapshot taken
[2022-06-27T14:06:14.855312155Z INFO  solana_runtime::bank] bank frozen: 138773289 hash: GU4bM6ogSs1WCXLty2ZZMv4oUycmkCoXN3iKZXqxSU4A accounts_delta: Fjd9fwS3duPsMJnL9C2bgyrHrD9ZjVWpw6W5oU9QHhmn signature_count: 3427 last_blockhash: DvSuxyuA8LaTBK8XJv2a4Dsck7Lt3gr5LWWWPdVm1BrL capitalization: 1503173472970177932
[2022-06-27T14:06:14.855343755Z INFO  solana_runtime::bank] accounts hash slot: 138773289 stats: BankHashStats { num_updated_accounts: 5109, num_removed_accounts: 0, num_lamports_stored: 298656174786929275, total_data_len: 6397015, num_executable_accounts: 0 }
[2022-06-27T14:06:16.124748643Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 138773289,
            root: 138772911,
            slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: GU4bM6ogSs1WCXLty2ZZMv4oUycmkCoXN3iKZXqxSU4A, bank_status: Frozen(GU4bM6ogSs1WCXLty2ZZMv4oUycmkCoXN3iKZXqxSU4A) })

#### Last frozen bank that was rooted
[2022-06-27T14:11:58.133203233Z INFO  solana_runtime::bank] bank frozen: 138773761 hash: 6PJmHZDWRvJ2oMhTC2jqgufrqzTHepes6T2UDtA5CRwr accounts_delta: AJaWXN6d4hApciGtypKK8szTMY57E2jcFjnjtk9BFnRk signature_count: 2446 last_blockhash: A4hXWVCnAWAA91LytgaCgVF5yNf3VSDkExcmVixm9Vsa capitalization: 1503173471040218399
[2022-06-27T14:11:58.133217921Z INFO  solana_runtime::bank] accounts hash slot: 138773761 stats: BankHashStats { num_updated_accounts: 4071, num_removed_accounts: 0, num_lamports_stored: 380487911859993379, total_data_len: 6175916, num_executable_accounts: 0 }

#### Last root
[2022-06-27T14:11:59.495309479Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 138773761,
            root: 138773261,
            slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: 6PJmHZDWRvJ2oMhTC2jqgufrqzTHepes6T2UDtA5CRwr, bank_status: Frozen(6PJmHZDWRvJ2oMhTC2jqgufrqzTHepes6T2UDtA5CRwr) })
[2022-06-27T14:12:05.613526343Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 138773761,
            root: 138773262,
            slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: 6PJmHZDWRvJ2oMhTC2jqgufrqzTHepes6T2UDtA5CRwr, bank_status: Frozen(6PJmHZDWRvJ2oMhTC2jqgufrqzTHepes6T2UDtA5CRwr) })
[2022-06-27T14:17:33.628042366Z INFO  solana_core::replay_stage] new root 138773761

#### Incremental snapshot created
[2022-06-27T14:12:26.906032243Z INFO  solana_runtime::snapshot_utils] Successfully created "/home/sol/validator-ledger/incremental-snapshot-138726774-138773288-agCF5RhnoN76u7pHYS761NooAAifWuuy7L4SG7eeZ8a.tar.zst". slot: 138773288, elapsed ms: 5388, size=219743473

#### Last frozen bank that agrees with cluster
[2022-06-27T14:17:34.656480347Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 138774303,
            root: 138773761,
            slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: HywpywCHEU3Jo4tcbC2xoAay9DTwkbZZG7y5sUw6AgvW, bank_status: Frozen(HywpywCHEU3Jo4tcbC2xoAay9DTwkbZZG7y5sUw6AgvW) })

######## Panic #1
[2022-06-27T14:17:36.766382216Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="tokio-runtime-worker" one=1i message="panicked at 'called `Option::unwrap()` on a `None` value', streamer/src/nonblocking/quic.rs:420:65" location="streamer/src/nonblocking/quic.rs:420:65" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

#### Restart #1 - loading from full & incremental
[2022-06-27T14:18:27.365756060Z INFO  solana_runtime::snapshot_utils] Loading bank from full snapshot: /home/sol/validator-ledger/snapshot-138726774-58YDKPX6MBzQBG79vpfbNd5xqDD77r5nT7PgiGdwvvpo.tar.zst, and incremental snapshot: Some("/home/sol/validator-ledger/incremental-snapshot-138726774-138773288-agCF5RhnoN76u7pHYS761NooAAifWuuy7L4SG7eeZ8a.tar.zst")

#### Re-freeze of incremental snapshot slot
#### All hashes match
[2022-06-27T14:18:44.634951057Z INFO  solana_runtime::bank] bank frozen: 138773288 hash: GKGWXijGzAeHJesRmM3FWy1sqJEHBwQpmjRRxRaWuG9N accounts_delta: 9i1wPaWVco82VQycCaWJAmJ75Kbr3D7jPrrSwBVrVKWd signature_count: 4049 last_blockhash: 34Sr7xQNucjcgbAACnunH2rvjzLFWZ8VdxT65UUmuTWk capitalization: 1503173472978444308
[2022-06-27T14:18:44.634967889Z INFO  solana_runtime::bank] accounts hash slot: 138773288 stats: BankHashStats { num_updated_accounts: 7937, num_removed_accounts: 0, num_lamports_stored: 584568066570500347, total_data_len: 14621377, num_executable_accounts: 0 }

#### Re-freeze of first slot after incremental snapshot slot
#### Bank hash deviates / accounts_delta deviates / last_blockhash matches
[2022-06-27T14:18:45.098747233Z INFO  solana_runtime::bank] bank frozen: 138773289 hash: 5XShLx2o1G9nCEQsQnS7JBTZUoPwmiCNTsmCaEncQqED accounts_delta: AcoUgfiNBzauNvufeXV6YbyfPYrdGTMMJ3Y1Vha4oRz signature_count: 3427 last_blockhash: DvSuxyuA8LaTBK8XJv2a4Dsck7Lt3gr5LWWWPdVm1BrL capitalization: 1503173472970177932
[2022-06-27T14:18:45.098762422Z INFO  solana_runtime::bank] accounts hash slot: 138773289 stats: BankHashStats { num_updated_accounts: 5109, num_removed_accounts: 0, num_lamports_stored: 298656174786929275, total_data_len: 6397015, num_executable_accounts: 0 }

#### Re-freeze of latest root before panic #1
#### Bank hash deviates / accounts_delta deviates / last_blockhash matches
[2022-06-27T14:19:26.130767847Z INFO  solana_runtime::bank] bank frozen: 138773761 hash: 811FgHr7VJH1pW7MCx35DCheQEi9Vgu49YFmXsovASrK accounts_delta: 5gns6ZkekRGVmJgbAvjpPaQigkY5K7oUGScxww9ifrq8 signature_count: 2446 last_blockhash: A4hXWVCnAWAA91LytgaCgVF5yNf3VSDkExcmVixm9Vsa capitalization: 1503173471039150517
[2022-06-27T14:19:26.130782655Z INFO  solana_runtime::bank] accounts hash slot: 138773761 stats: BankHashStats { num_updated_accounts: 2458, num_removed_accounts: 0, num_lamports_stored: 315214412910007789, total_data_len: 157813, num_executable_accounts: 0 }

######## Panic #2
[2022-06-27T14:20:21.776152518Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-replay-stage" one=1i message="panicked at 'slot=138774303 must exist in ProgressMap', core/src/progress_map.rs:576:32" location="core/src/progress_map.rs:576:32" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

#### Restart #2 - hashes match results form Restart #1, which again, deviate from validator prior to Panic #1

@steviez
Copy link
Contributor

steviez commented Jun 30, 2022

I can recreate the invalid hash with solana-ledger-tool verify. Ledger blockhashes match so I think we can rule that out. Currently assembling additional ledger data to do some longer replay and try to evaluate legitimacy of snapshot(s).

A few notes:

  • accounts_delta differs between initial run and subsequent run despite all of the items in BankHashStats being the same
  • I noticed the full snapshot and incremental snapshot are from different epochs (333 and 334, respectively). We've had some deviant behavior before across epoch boundaries, so definitely something I'm keeping in the foreground

@steviez steviez self-assigned this Jun 30, 2022
@steviez
Copy link
Contributor

steviez commented Jun 30, 2022

A few updates:

  • I tried replaying from the full snapshot only (no incremental, and thus replayed across the epoch boundary), this produces the correct hash for slot 138773289
  • I tried creating a full snapshot at the same slot that the incremental snapshot is at (138773288). This produces the correct hash for 138773288, but the incorrect hash for slot 138773289.

From the two above, it seems that there is some state isn't being captured in the snapshot, or that the re-establishment of state from snapshot isn't working correctly. After brainstorming with @brooksprumo, I'm now going to try to directly compare banks, one from the original full snapshot with replay, and one from loading the snapshot directly. Bank implements == and Debug trait, so I should be able to check equivalence and/or get a print out of what might be differing.

@behzadnouri
Copy link
Contributor

@steviez
Copy link
Contributor

steviez commented Jul 4, 2022

I figured this out this past weekend, PR on the way shortly after this comment. The issue is related to lamports_per_signature field, see #25364 for more details on this topic. Essentially, we were not previously storing this field in snapshots, so if the slot that the snapshot was taken at happened to have a non-default value, that information would be lost to the node unpacking the snapshot, and they'd then deviate from consensus on subsequent slots.

This is exactly what happened with the incremental snapshot slot (138773288) that tlambertz provided me as some logging shows:

Fee rate governor for slot 138773288: FeeRateGovernor { lamports_per_signature: 6000, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }
Fee rate governor for slot 138773289: FeeRateGovernor { lamports_per_signature: 5500, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }

So, with this in mind, we can fully explain the sequence of events that tlambertz observed:

  • Node dies with panic (this has since been fixed)
  • Node restarts at slot 138773288 (slot of incremental snapshot)
  • Snapshot doesn't contain lamports_per_signature field, so default used
  • Default value (which incorrect for this slot) causes deviation from consensus
  • Node eventually notices that its' bank hashes mismatch from cluster, starts purging, ends up ProgressMap panic

The lamports_per_signature PR was backported to v1.10, but only the read aspect of it. That is, snapshots that contained the field (ie from a v1.11 node) could be read. However, the write aspect (storing value in created snapshots) wasn't backported. Backporting that aspect is the fix here.

@t-nelson
Copy link
Contributor

t-nelson commented Jul 6, 2022

Fee rate governor for slot 138773288: FeeRateGovernor { lamports_per_signature: 6000, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }
Fee rate governor for slot 138773289: FeeRateGovernor { lamports_per_signature: 5500, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }

wao! mb hit congestion fees?

@steviez
Copy link
Contributor

steviez commented Jul 6, 2022

wao! mb hit congestion fees?

Technically, all of the actual debugging and reproduction was done on a testnet snapshot + ledger, and I can only definitively say that we hit the elevated value on testnet.

The initial issue report that was transcribed from Discord failed in the same way, but it is possible that something else could have caused the consensus deviation ... not enough logs provided to know with certainty

@t-nelson
Copy link
Contributor

t-nelson commented Jul 6, 2022

bench tps hits it on testnet pretty often. uncovered a similar bug for fee_calculator storage in durable nonce accounts

@steviez
Copy link
Contributor

steviez commented Jul 7, 2022

v1.11 and master weren't affected; fix will first be included in v1.10.30

@steviez steviez closed this as completed Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants