Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastboot fails if the node crashes while archiving a full snapshot #35367

Closed
brooksprumo opened this issue Feb 29, 2024 · 1 comment · Fixed by anza-xyz/agave#343
Closed

Fastboot fails if the node crashes while archiving a full snapshot #35367

brooksprumo opened this issue Feb 29, 2024 · 1 comment · Fixed by anza-xyz/agave#343
Assignees

Comments

@brooksprumo
Copy link
Contributor

brooksprumo commented Feb 29, 2024

Problem

If a node crashes will archiving a full snapshot, and if it has created more (incremental) bank snapshots based on that full snapshot, then fastboot will likely fail with an error message like:

incremental snapshot requires accounts hash and capitalization from the full snapshot it is based on
Problem Details

Here's an example based on an error message that @jstarry sent me, after he added the patch from #35353:

[2024-02-29T01:51:48.833574440Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solAcctHashVer" one=1i message="panicked at core/src/accounts_hash_verifier.rs:328:21:
   0: rust_begin_unwind
             at ./rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at ./rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: solana_core::accounts_hash_verifier::AccountsHashVerifier::process_accounts_package
    incremental snapshot requires accounts hash and capitalization from the full snapshot it is based on 
    package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251086076)), slot: 251099346, block_height: 231733000, .. } 
    accounts hashes: {251098305: (AccountsHash(BNauzhxdBL7ZjVdQYwA5sFX6U4ZFnX8x1z2faXhCx5vy), 570710273002510414)} 
    incremental accounts hashes: {251099045: (IncrementalAccountsHash(HTXWvmVKKHB8RYwHfjdgr1oX1rmchDQC6kuBH2jeAo18), 20095246361845682)} 
    full snapshot archives: [FullSnapshotArchiveInfo(SnapshotArchiveInfo { path: \"/mnt/snapshots/snapshot-251086076-7ZCZ8PiKRTxjgmaQTXKGaPoEYA688deZ4x8r6FHUS5qQ.tar.zst\", slot: 251086076, hash: SnapshotHash(7ZCZ8PiKRTxjgmaQTXKGaPoEYA688deZ4x8r6FHUS5qQ), archive_format: TarZstd }), FullSnapshotArchiveInfo(SnapshotArchiveInfo { path: \"/mnt/snapshots/snapshot-251073753-325bXiW6oATv8i6LezYPR2hNu4P3JtrCECQLZBiJ9J22.tar.zst\", slot: 251073753, hash: SnapshotHash(325bXiW6oATv8i6LezYPR2hNu4P3JtrCECQLZBiJ9J22), archive_format: TarZstd })] 
    bank snapshots: [BankSnapshotInfo { slot: 251099346, snapshot_type: Pre, snapshot_dir: \"/mnt/incremental-snapshots/snapshot/251099346\", snapshot_version: V1_2_0 }, BankSnapshotInfo { slot: 251099045, snapshot_type: Post, snapshot_dir: \"/mnt/incremental-snapshots/snapshot/251099045\", snapshot_version: V1_2_0 }]" location="core/src/accounts_hash_verifier.rs:328:21" version="1.17.22 (src:2c5aa387; feat:3580551090, client:JitoLabs)"

Here's an example for the error:

  • SPS is in the middle of archiving a full snapshot (slot 251098305 in this case)
  • AHV processes the next incremental snapshot successfully. So the bank snapshot for AHV is created. And its expected full snapshot slot is 251098305.
  • Maybe a few more incremental bank snapshots are created too.
  • The node crashes before the full snapshot finishes being archived
  • At next startup, fastboot will grab the latest bank snapshot (slot 251099346), which will contain the account hashes for full snapshot at slot 251098305
  • When ABS starts up, it is told what the last full snapshot archive's slot is, which is slot 251086076
  • The first new snapshot request will get sent to ABS, and it'll be an incremental snapshot. ABS knows that the last full snapshot was for slot 251086076, so ABS packages up new snapshot with the old full snapshot slot and sends it over to AHV
  • AHV sees the incremental snapshot request and so it says "You asked me to make an incremental snapshot, so I need to know about the full snapshot you were based on. Please tell me about the full snapshot slot and its accounts hash". The request will say it's full snapshot is 251086076, but AccountsDb will only know about 251098305.
  • And then the panic is triggered.
More info

Here's more logs from Justin's machine. `

[2024-02-29T01:33:55.752948313Z INFO  solana_runtime::snapshot_bank_utils] Creating bank snapshot for slot 251098305, path: /mnt/incremental-snapshots/snapshot/251098305/251098305.pre
[2024-02-29T01:34:06.521515316Z INFO  solana_runtime::snapshot_bank_utils] bank serialize took 3.9s for slot 251098305 at /mnt/incremental-snapshots/snapshot/251098305/251098305.pre
[2024-02-29T01:34:06.522263848Z INFO  solana_runtime::snapshot_package] Package snapshot for bank 251098305 has 421257 account storage entries (snapshot kind: FullSnapshot)
[2024-02-29T01:34:06.522275888Z INFO  solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(FullSnapshot), slot: 251098305, bank hash: DrAkL3ceUCtYC5TEGx5orqkLn3FXiPQ5JxwWK8jgn1vT
[2024-02-29T01:34:06.859756460Z INFO  solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(FullSnapshot), slot: 251098305, block_height: 231732000, .. }
[2024-02-29T01:34:51.320797198Z INFO  solana_runtime::snapshot_package] Package snapshot for bank 251098409 has 421253 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:34:51.320814578Z INFO  solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098409, bank hash: 7RiJFiJqvkZ4YchCo7SxtdjrwN4F5zrKXz4Ngy5NwpUE
[2024-02-29T01:34:53.575408055Z INFO  solana_accounts_db::accounts_db] calculate_accounts_hash_from_storages: slot: 251098305, Full(AccountsHash(BNauzhxdBL7ZjVdQYwA5sFX6U4ZFnX8x1z2faXhCx5vy)), capitalization: 570710273002510414
[2024-02-29T01:35:06.154583143Z INFO  solana_metrics::metrics] datapoint: fastboot slot=251098305i num_storages_total=421257i num_storages_kept_alive=15i
[2024-02-29T01:35:06.154587592Z INFO  solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098409, block_height: 231732100, .. }
[2024-02-29T01:35:06.233502620Z INFO  solana_core::snapshot_packager_service] handling snapshot package: SnapshotPackage { type: FullSnapshot, slot: 251098305, block_height: 231732000, .. }
[2024-02-29T01:35:06.233526780Z INFO  solana_runtime::snapshot_utils] Generating snapshot archive for slot 251098305
[2024-02-29T01:35:27.307496396Z INFO  solana_runtime::snapshot_package] Package snapshot for bank 251098509 has 421247 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:35:27.307507596Z INFO  solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098509, bank hash: HggkzD851JtZK6JxpHuadqRayB5yUh1zU5Dn2ZKu7xLG
[2024-02-29T01:35:27.592021341Z INFO  solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098509, block_height: 231732200, .. }
[2024-02-29T01:36:08.734285801Z INFO  solana_runtime::snapshot_package] Package snapshot for bank 251098613 has 421251 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:36:08.734305441Z INFO  solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098613, bank hash: 3FRAp6Yn6EKibVTXDhEmVxsppvSSRFXP3PAZidj5UFgV
[2024-02-29T01:36:08.737970571Z INFO  solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098613, block_height: 231732300, .. }
[2024-02-29T01:36:53.860137018Z INFO  solana_runtime::snapshot_package] Package snapshot for bank 251098713 has 421255 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:36:53.860152468Z INFO  solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098713, bank hash: B5ciX3tdiSdAZVoHz6vQ2THRGdhuhfBiBPY7ZPjvR97p

In particular:

solana_core::snapshot_packager_service] handling snapshot package: SnapshotPackage { type: FullSnapshot, slot: 251098305, block_height: 231732000, .. }
solana_runtime::snapshot_utils] Generating snapshot archive for slot 251098305

This confirms that:

  • The node was archiving a full snapshot for slot 251098305 when it crashed
  • The node had created multiple incremental bank snapshots beyond slot 251098305

Proposed Solution

Yikes! So we need some way to identify if the fastboot bank snapshot matches the actual full snapshot archives on disk or not. And if they don't, this bank snapshot should be purged.

Unfortunately, we cannot use older bank snapshots, because their account storage files have likely been recycled/shrunk. So we need to fallback on using a snapshot archive.
(Edit: The recycler has now been removed, so in theory we could use older bank snapshots. This needs testing first.)

Option 1:
When taking a snapshot, add a new file to indicate full vs incremental, and the important slots. Then, at load time, fastboot can see if it's an incremental, and what the base slot is. If there's not a full snapshot archive with the given slot, then we cannot use this snapshot. Delete it.

If this process is done before we decide to fastboot or not, then it should correctly restart with a snapshot archive.

Option 2:
Similar to Option 1, we add the same new file to indicate the important slots. But instead, at load time, if there's not a full snapshot archive for the given slot, then we immediately generate a new full snapshot archive for the next snapshot request. This may have more code changes to handle. And may increase disk io. But does start up from a more-recent slot than Option 1. If there's another crash before the new full snapshot archive is made, we'll likely be in the same scenario.
(h/t to @apfitzge for this possible solution)

Work-arounds

There are some work-arounds available already, and they boil down to loading from a snapshot archive, instead of local state.

  1. Use --use-snapshot-archives-at-startup always, to force loading from a snapshot archive
  2. Delete the (non-archive) snapshots directory (often ledger/snapshots/), which is where the local state lives. By deleting this directory, startup will fallback to loading from a snapshot archive.

Now that #35350 has merged, the problematic local state will be removed automatically. So a subsequent restart—without needing to do anything manually—should work.

@brooksprumo
Copy link
Contributor Author

An additional note, currently the system cannot recover from this. Meaning subsequent reboots will keep hitting this panic (or the other one about invalid append vecs: #35190).

Luckily #35350 will recover, so if there's one failure, the next reboot will successfully fallback to loading from a snapshot archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant