Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks if bank snapshot is loadable before fastbooting #343

Merged

Conversation

brooksprumo
Copy link

Problem

Please refer to solana-labs#35367

Summary of Changes

Before loading a bank snapshot for fastboot, ensure it can be used! This means that if we do fastboot the bank snapshot, it has the correct other stuff on disk (aka the right full snapshot archive) so that subsequent snapshots will succeed, and we won't hit the error described in the problem.

Fixes solana-labs#35367

@brooksprumo brooksprumo self-assigned this Mar 20, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 68.13187% with 58 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (0e932c7) to head (796f5d1).
Report is 38 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master     #343    +/-   ##
========================================
  Coverage    81.9%    81.9%            
========================================
  Files         837      837            
  Lines      226830   226979   +149     
========================================
+ Hits       185856   186010   +154     
+ Misses      40974    40969     -5     

@brooksprumo brooksprumo marked this pull request as ready for review March 20, 2024 21:13
_ => accounts_package.slot,
};
snapshot_utils::write_fastboot_file(&snapshot_info.bank_snapshot_dir, fastboot_slot)
.unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed this one. Let me fix that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Fixed in 3c2e77a.

let Some(bank_snapshot) = latest_bank_snapshot else {
// If we do *not* have a local snapshot to fastboot, then it must be because there was
// *not* a loadable local snapshot AND we shall *not* startup from a snapshot archive.
assert_eq!(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, a little hairy.

If we get inside this else of the let-else, then we are in a condition where (1) we're supposed to fastboot, and (2) there is no local state (i.e. a bank snapshot) to fastboot from. This should only happen if we've specified to never use a snapshot archive. So we assert that here.

If instead we used either when-newest or always for when to load from snapshot archives, then we should not be in this else of the outer if-else. Instead we should've set will_startup_from_snapshot_archives to true, since latest_bank_snapshot would've been None.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jeff here, it seems odd to me. We are basically assigning a bool, then having an assert we're assigning it correctly?

Line 250:

    let will_startup_from_snapshot_archives = match process_options.use_snapshot_archives_at_startup
    {
        UseSnapshotArchivesAtStartup::Always => true,
        UseSnapshotArchivesAtStartup::Never => false,
        UseSnapshotArchivesAtStartup::WhenNewest => latest_bank_snapshot
            .as_ref()
            .map(|bank_snapshot| latest_snapshot_archive_slot > bank_snapshot.slot)
            .unwrap_or(true),
    };
    let bank = if will_startup_from_snapshot_archives {
        // ...stuff
    } else {
            // fastboot from local state
            let Some(bank_snapshot) = latest_bank_snapshot else {
            // If we do *not* have a local snapshot to fastboot, then it must be because there was
            // *not* a loadable local snapshot AND we shall *not* startup from a snapshot archive.
            assert_eq!(
                process_options.use_snapshot_archives_at_startup,
                UseSnapshotArchivesAtStartup::Never,
            );
            return Err(BankForksUtilsError::NoBankSnapshotDirectory {
                flag: use_snapshot_archives_at_startup::cli::LONG_ARG.to_string(),
                value: UseSnapshotArchivesAtStartup::Never.to_string(),
            });
        };
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively yes 😅 The assert is there to ensure we don't refactor something and cause the returned error message to be wrong/misleading.

(I do like your suggestion here #343 (review) too, so likely will go in that direction.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been refactored. Please refer to #343 (comment).

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense but some of the naming could be more explicit.

I think we might do a small refactor to make some of this more clear. We have this odd pattern where we are getting highest loadable snapshot, doing match on how we start up, then an if based on the result of the previous match.

Can we do something like this?

    let fastboot_snapshot = match process_options.use_snapshot_archives_at_startup {
        UseSnapshotArchivesAtStartup::Always => None,
        UseSnapshotArchivesAtStartup::Never => {
            let fastboot_snapshot =
                snapshot_utils::get_highest_loadable_bank_snapshot(snapshot_config);
            if fastboot_snapshot.is_none() {
                return Err(BankForksUtilsError::NoBankSnapshotDirectory {
                    flag: use_snapshot_archives_at_startup::cli::LONG_ARG.to_string(),
                    value: UseSnapshotArchivesAtStartup::Never.to_string(),
                });
            }
            fastboot_snapshot
        }
        UseSnapshotArchivesAtStartup::WhenNewest => {
            snapshot_utils::get_highest_loadable_bank_snapshot(snapshot_config)
                .filter(|bank_snapshot| bank_snapshot.slot >= latest_snapshot_archive_slot)
        }
    };

    if let Some(fastboot_snapshot) = fastboot_snapshot {
        // fast boot code
    } else {
        // slow boot code
    }

The get_highest_loadable_bank_snapshot call is in multiple places, but we treat it differently because the behavior is different for those cases.
The error handling is more upfront if we cannot find it for Never case, and don't have this (match, if, match) thing going on.
At least to me something like this is more clear, wdyt?

@@ -303,6 +311,20 @@ impl AccountsHashVerifier {
&accounts_hash_for_reserialize,
bank_incremental_snapshot_persistence.as_ref(),
);

// now write the fastboot file after reserializing so this bank snapshot is loadable
let fastboot_slot = match accounts_package.package_kind {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this? It's more like "fastboot_most_recent_fullsnapshot_slot" or "fastboot_base_slot".

Same with the "fastboot_file". It seems like it should be "fastboot_base_slot_file".
Current naming made me think it contained the slot we were fastbooting from.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion; will do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the fastboot file, do you want the function renamed, the comment reworded, or the file itself renamed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming fastboot_slot has been fixed in 2e351f3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being more clear. I was hoping to rename the file and whole "concept" of this file.
We're using it for fastboot, sure, but the file itself contains the slot of the full snapshot this snapshot is based on. Seems like that's potentially useful info even without fastboot, so giving it a more clear name everywhere (file name, function names, variable names) would make it more obvious what information is available there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Yeah, that makes sense. The file could be "full_snapshot_archive_slot" instead. I was purposely going the opposite way; being ambiguous about what's inside the file, but indicating that it is used for fastboot. Renaming the file would be clear about the contents of the file, but ambiguous about usage.

I think for the sake of any forward compatibility/breaking changes, I'll go with the "full_snapshot_archive_slot" idea. This is unlikely to change. If we need more/different information later, we'll create another file.

Does that sound ok?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks @brooksprumo!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Renamed in 796f5d1.

let Some(bank_snapshot) = latest_bank_snapshot else {
// If we do *not* have a local snapshot to fastboot, then it must be because there was
// *not* a loadable local snapshot AND we shall *not* startup from a snapshot archive.
assert_eq!(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jeff here, it seems odd to me. We are basically assigning a bool, then having an assert we're assigning it correctly?

Line 250:

    let will_startup_from_snapshot_archives = match process_options.use_snapshot_archives_at_startup
    {
        UseSnapshotArchivesAtStartup::Always => true,
        UseSnapshotArchivesAtStartup::Never => false,
        UseSnapshotArchivesAtStartup::WhenNewest => latest_bank_snapshot
            .as_ref()
            .map(|bank_snapshot| latest_snapshot_archive_slot > bank_snapshot.slot)
            .unwrap_or(true),
    };
    let bank = if will_startup_from_snapshot_archives {
        // ...stuff
    } else {
            // fastboot from local state
            let Some(bank_snapshot) = latest_bank_snapshot else {
            // If we do *not* have a local snapshot to fastboot, then it must be because there was
            // *not* a loadable local snapshot AND we shall *not* startup from a snapshot archive.
            assert_eq!(
                process_options.use_snapshot_archives_at_startup,
                UseSnapshotArchivesAtStartup::Never,
            );
            return Err(BankForksUtilsError::NoBankSnapshotDirectory {
                flag: use_snapshot_archives_at_startup::cli::LONG_ARG.to_string(),
                value: UseSnapshotArchivesAtStartup::Never.to_string(),
            });
        };
    }

runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
ledger/src/bank_forks_utils.rs Outdated Show resolved Hide resolved
@brooksprumo
Copy link
Author

I think we might do a small refactor to make some of this more clear. [..]
Can we do something like this? [..]

Done! First ff1867c swaps the if-else block (to make the next diff easier). And second, 8b80597 does the refactor.

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think the refactor made it a lot clearer, thanks! A couple more comments/questions

core/src/accounts_hash_verifier.rs Show resolved Hide resolved
@@ -303,6 +311,20 @@ impl AccountsHashVerifier {
&accounts_hash_for_reserialize,
bank_incremental_snapshot_persistence.as_ref(),
);

// now write the fastboot file after reserializing so this bank snapshot is loadable
let fastboot_slot = match accounts_package.package_kind {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being more clear. I was hoping to rename the file and whole "concept" of this file.
We're using it for fastboot, sure, but the file itself contains the slot of the full snapshot this snapshot is based on. Seems like that's potentially useful info even without fastboot, so giving it a more clear name everywhere (file name, function names, variable names) would make it more obvious what information is available there.

/// To be loadable, the bank snapshot must be a BankSnapshotKind::Post.
/// And if we're generating snapshots (e.g. running a normal validator), then
/// the full snapshot file's slot must match the highest full snapshot archive's.
pub fn get_highest_loadable_bank_snapshot(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a separate discussion with @brooksprumo about this function.
There is an edge-case that is not handled as efficiently as we'd like, but we decided to leave it as a follow-up since this PR covers the majority of cases and already has a lot going on.

The edge case is when we have a higher full snapshot archive than our highest snapshot. This can happen if ledger-tool is used to create a full snapshot archive - this does not create a snapshot.

This function could likely be improved so that we only check the existence of the full snapshot archive matching the full_snapshot_file_slot, even if it not the latest full snapshot archive. However, it requires changes elsewhere in ABS (others?) to gate cleaning properly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! I'll handle this in a follow-up PR.

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with changes as they are, as mentioned in my comment, it covers most of the cases and we are leaving an edge-case as follow-up work.

@brooksprumo brooksprumo merged commit 182d27f into anza-xyz:master Mar 28, 2024
37 checks passed
@brooksprumo brooksprumo deleted the fastboot/missing-full-snapshot-hash branch March 28, 2024 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fastboot fails if the node crashes while archiving a full snapshot
4 participants