Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purges all bank snapshots after fastboot #35350

Merged
merged 2 commits into from
Feb 29, 2024

Conversation

brooksprumo
Copy link
Contributor

@brooksprumo brooksprumo commented Feb 28, 2024

Problem

Given the following scenario:

  • a (bank) snapshot interval of 100
  • a node starts up and loads from a bank snapshot at slot 200 (i.e. fastboot)
  • the node crashes shortly after load; before the next bank snapshot is taken (before slot 300)
  • and shrink has run (e.g. shrink proceeds up to slot 275)
  • the node restarts, still with fastboot enabled

Then the result is:

  • due to shrink, the account storage files on disk will (may) have changed, and reflect the account state as of slot 275
  • the bank snapshot at 200 has hard links to the account storage files, and those account storage files may have been shrunk
  • the bank snapshot contains information about the accounts as of slot 200, but the actual storages reflect the accounts as of slot 275
  • at startup, the bank snapshot at slot 200 will be selected for fastboot again, and the node will crash, saying that the/an account storage file does not hold the correct number of accounts

If the node has a script to auto-restart, it will enter a boot-crash loop indefinitely.

Summary of Changes

To break out of the boot-crash loop, purge all bank snapshots after loading from one. In the above scenario, this would cause the node to load from a snapshot archive instead, which is safe. If the node crashes after creating another bank snapshot, then fastboot will work properly again in that situation.

Fixes #35190

More Info

This problem has been hit on v1.17, so I want a solution that can be backported to v1.17. The implemented solution is the least invasive—and safest—one I am aware of. A different solution to allow successfully reusing the bank snapshot from slot 200 will land in master and not be backported.

@brooksprumo brooksprumo self-assigned this Feb 28, 2024
@brooksprumo brooksprumo added v1.17 PRs that should be backported to v1.17 v1.18 PRs that should be backported to v1.18 labels Feb 28, 2024
Copy link
Contributor

mergify bot commented Feb 28, 2024

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

Copy link
Contributor

mergify bot commented Feb 28, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

@brooksprumo brooksprumo marked this pull request as ready for review February 28, 2024 16:40
Copy link

codecov bot commented Feb 28, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 81.7%. Comparing base (8ad125d) to head (35546ca).
Report is 21 commits behind head on master.

❗ Current head 35546ca differs from pull request most recent head 3e42432. Consider uploading reports for the commit 3e42432 to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #35350     +/-   ##
=========================================
- Coverage    81.7%    81.7%   -0.1%     
=========================================
  Files         834      834             
  Lines      224235   224236      +1     
=========================================
- Hits       183390   183367     -23     
- Misses      40845    40869     +24     

@brooksprumo
Copy link
Contributor Author

@apfitzge Requesting your review since the PR is fastboot-related, and you've reviewed all the fastboot code
@jeffwashington Requesting your review since it touches anything storage related
@steviez Requesting your review since you have a good eye for the system as a whole and if there may be unintended consequences with the impl

Also note I intend to backport this PR, so also consider if there's anything that should be done differently now that'll make the backporting better.

@jeffwashington
Copy link
Contributor

@brooksprumo can you please link the pr that adds purge_all_bank_snapshots that I think also must be backported?

@brooksprumo
Copy link
Contributor Author

@brooksprumo can you please link the pr that adds purge_all_bank_snapshots that I think also must be backported?

I wasn't originally intending to backport #35291. Instead I was planning on using the previous way to purge all bank snapshots:

snapshot_utils::purge_old_bank_snapshots(&snapshot_config.bank_snapshots_dir, 0, None);

Would backporting #35291 be preferred?

My though process was the backport the least amount of code possible.

@jeffwashington
Copy link
Contributor

@brooksprumo can you please link the pr that adds purge_all_bank_snapshots that I think also must be backported?

I wasn't originally intending to backport #35291. Instead I was planning on using the previous way to purge all bank snapshots:

snapshot_utils::purge_old_bank_snapshots(&snapshot_config.bank_snapshots_dir, 0, None);

Would backporting #35291 be preferred?

My though process was the backport the least amount of code possible.

so the backport pr from this pr into 1.18 would be manually adjusted so it compiles in 1.18?

@brooksprumo
Copy link
Contributor Author

so the backport pr from this pr into 1.18 would be manually adjusted so it compiles in 1.18?

Yes, that's what I was thinking.

@jeffwashington
Copy link
Contributor

@brooksprumo can you please link the pr that adds purge_all_bank_snapshots that I think also must be backported?

I wasn't originally intending to backport #35291. Instead I was planning on using the previous way to purge all bank snapshots:

snapshot_utils::purge_old_bank_snapshots(&snapshot_config.bank_snapshots_dir, 0, None);

Would backporting #35291 be preferred?
My though process was the backport the least amount of code possible.

so the backport pr from this pr into 1.18 would be manually adjusted so it compiles in 1.18?

ugh. another approach is to submit what will compile in 1.18 in THIS pr and then backport that, then in a follow on master pr, change this code to use purge_old_bank_snapshots That leaves the commit in 1.18 matching a commit in master exactly.

Another alternative is don't backport anything, at least not to 1.17.

any other opinions? Am I being too pedantic?

@brooksprumo
Copy link
Contributor Author

ugh. another approach is to submit what will compile in 1.18 in THIS pr and then backport that, then in a follow on master pr, change this code to use purge_old_bank_snapshots That leaves the commit in 1.18 matching a commit in master exactly.

Yeah, I considered this too. My thought was that the actual code change is a single line, so it will be very simple to inspect the backports for correctness; that they indeed are purging all bank snapshots.

I can adopt this approach if it's preferred though.

Another alternative is don't backport anything, at least not to 1.17.

Yes, that's true. However it feels dangerous to leave a known issue like this. One that requires manual intervention from a node operator in order for the node to startup.

@brooksprumo
Copy link
Contributor Author

ugh. another approach is to submit what will compile in 1.18 in THIS pr and then backport that, then in a follow on master pr, change this code to use purge_old_bank_snapshots That leaves the commit in 1.18 matching a commit in master exactly.

Done in 3e42432. I've changed this PR to use purge_old_bank_snapshots() to allow for a clean backport.

Another alternative is don't backport anything, at least not to 1.17.

Per offline discussions, we will not backport to v1.17; only v1.18.

@brooksprumo brooksprumo removed the v1.17 PRs that should be backported to v1.17 label Feb 29, 2024
Copy link
Contributor

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ty.

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - easy backport.

@brooksprumo brooksprumo merged commit bdc5cce into solana-labs:master Feb 29, 2024
36 checks passed
@brooksprumo brooksprumo deleted the fastboot/purge-after-load branch February 29, 2024 19:31
mergify bot pushed a commit that referenced this pull request Feb 29, 2024
@steviez
Copy link
Contributor

steviez commented Feb 29, 2024

Oops, thought I gave this one a ship-it; glad you didn't wait but no concerns from my end either + LGTM

@brooksprumo
Copy link
Contributor Author

Oops, thought I gave this one a ship-it; glad you didn't wait but no concerns from my end either + LGTM

Thanks! For the merge to master I felt comfortable with two approvals. For the backport I'll wait for everyones 😸

brooksprumo added a commit that referenced this pull request Mar 1, 2024
…35379)

Purges all bank snapshots after fastboot (#35350)

(cherry picked from commit bdc5cce)

Co-authored-by: Brooks <brooks@solana.com>
grooviegermanikus pushed a commit to blockworks-foundation/solana that referenced this pull request Apr 9, 2024
godmodegalactus pushed a commit to blockworks-foundation/solana that referenced this pull request Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.18 PRs that should be backported to v1.18
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

sudo reboot mb validator recovery fails
4 participants