-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sudo reboot mb validator recovery fails #35190
Comments
I'll take a look. Thanks for filling this issue. |
Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including
Removing the arg fixed the issue |
@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe |
So far I have been unable to reproduce a failure with fastboot. Here's the experiments I've performed so far. For all of them I have specified Note that the terminology may not make sense, as this is copy-pasted from my own internal notes.
I'm not sure what else to try at the moment. Are there other permutations I've missed? |
Similar to author, restarted the systemd service. I'm starting to think maybe we're not exiting cleanly in |
See attached screenshot. This shows Script B, Test Methodology 2. This Script B recovered successfully, in approx 19 minutes. However, when |
This is what I was trying to reproduce by randomly killing the validator process in 6 & 7. It's possible I just didn't hit the issue too. Two runs is not a lot.
The If you happen to have the contents of what's in your |
That was a screenshot I took at the time (12 Feb, 1.17.20). Afriad I no longer have the server. |
After just hitting a space issue on my snapshot dir I crashed with this:
Which led to a boot crash loop with this being the only error there:
Further context on Discord: https://discord.com/channels/428295358100013066/689412830075551748/1210995187950686288 |
Ok, I've found the (an?) problem. The PR to fix it is here: #35350 |
Found the other problem. Here's a GH Issue for it: #35367. |
That’s great news Brooks - thanks |
#35350 has been merged, so the recovery aspect of this failure is now fixed. Other PRs will fix the underlying issues. |
Problem
1.17.20, 10-12 Feb 2024.
I was curious about a mb validator’s recovery ability. And so I used a spare non-voting mb validator to see if it could recover from an abrupt
sudo reboot
I ran default validator settings, so incremental snapshots happen every minute and full-snapshots every 3 hours.
I tried 2 different validator startup scripts:
Script A: included
—use-snapshot-archives-at-startup when newest
Script B: it was removed
Test Methodology 1:
sudo reboot
when incremental snapshots are available and less than 1 minute old:Script A: 3 reboots, 3 successful recoveries each in approx 13 mins
Script B: 3 reboots, 3 successful recoveries each in approx 15 mins
So far so good!
Test Methodology 2:
However, approximately 10 mins before the 3-hour full snapshot is due, the validator stops creating minute-by-minute incrementals and starts only creating the next full snapshot. This means the last incremental gets up to ~15 mins old. [At least, this is my interpretation of what it looks like it's doing!].
sudo reboot
at various times with an old/aging incremental during full snapshot creation (for clarity this is approx a 15 minute window every 3 hours):Script A:
Incremental 8 mins old: Failed
Incremental 6 mins old: Failed
Incremental 5 mins old: Failed
Script B:
Incremental 7 mins old: Success, took 19 mins
Incremental 9 mins old: Success, took 20 mins
Incremental 13 mins old: Success, took 23 mins
For Test Methodology 2 using Script A, each time it failed for
ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/solana-accounts/run/247471897.30603
Proposed Solution
On the advice of Brooks in the discord this issue is opened to address Script A - Test Methodology 2 - failing.
The text was updated successfully, but these errors were encountered: