Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sudo reboot mb validator recovery fails #35190

Closed
john-smith-solana opened this issue Feb 13, 2024 · 13 comments · Fixed by #35350
Closed

sudo reboot mb validator recovery fails #35190

john-smith-solana opened this issue Feb 13, 2024 · 13 comments · Fixed by #35350
Assignees
Labels
community Community contribution

Comments

@john-smith-solana
Copy link

john-smith-solana commented Feb 13, 2024

Problem

1.17.20, 10-12 Feb 2024.

I was curious about a mb validator’s recovery ability. And so I used a spare non-voting mb validator to see if it could recover from an abrupt sudo reboot

I ran default validator settings, so incremental snapshots happen every minute and full-snapshots every 3 hours.

I tried 2 different validator startup scripts:
Script A: included —use-snapshot-archives-at-startup when newest
Script B: it was removed


Test Methodology 1:

sudo reboot when incremental snapshots are available and less than 1 minute old:

Script A: 3 reboots, 3 successful recoveries each in approx 13 mins
Script B: 3 reboots, 3 successful recoveries each in approx 15 mins

So far so good!


Test Methodology 2:

However, approximately 10 mins before the 3-hour full snapshot is due, the validator stops creating minute-by-minute incrementals and starts only creating the next full snapshot. This means the last incremental gets up to ~15 mins old. [At least, this is my interpretation of what it looks like it's doing!].

sudo reboot at various times with an old/aging incremental during full snapshot creation (for clarity this is approx a 15 minute window every 3 hours):

Script A:
Incremental 8 mins old: Failed
Incremental 6 mins old: Failed
Incremental 5 mins old: Failed

Script B:
Incremental 7 mins old: Success, took 19 mins
Incremental 9 mins old: Success, took 20 mins
Incremental 13 mins old: Success, took 23 mins

For Test Methodology 2 using Script A, each time it failed for ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/solana-accounts/run/247471897.30603

Proposed Solution

On the advice of Brooks in the discord this issue is opened to address Script A - Test Methodology 2 - failing.

@john-smith-solana john-smith-solana added the community Community contribution label Feb 13, 2024
@brooksprumo brooksprumo self-assigned this Feb 14, 2024
@brooksprumo
Copy link
Contributor

I'll take a look. Thanks for filling this issue.

@segfaultdoc
Copy link
Contributor

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:

Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...

Removing the arg fixed the issue

@steviez
Copy link
Contributor

steviez commented Feb 17, 2024

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:

Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...

Removing the arg fixed the issue

@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe systemctl stop sol), or are you using something like solana-validator exit ?

@brooksprumo
Copy link
Contributor

So far I have been unable to reproduce a failure with fastboot. Here's the experiments I've performed so far. For all of them I have specified --use-snapshot-archives-at-startup when-newest on the cli.

Note that the terminology may not make sense, as this is copy-pasted from my own internal notes.

# initial version restart version restart method result
1 v1.17.23 v1.17.23 ./restart with bank snapshot POST OK
2 v1.17.23 v1.17.23 ./restart with bank snapshot PRE OK, uses next POST correctly
3 v1.17.23 v1.17.23 ./stop then ./restart withOUT a bank snapshot, just account hard links OK, uses next POST correctly
4 v1.17.23 v1.18.3 ./stop then ./restart withOUT a bank snapshot, just account hard links OK, uses next POST correctly
5 v1.18.3 v1.17.22 ./restart with bank snapshot POST OK
6 v1.17.22 v1.17.22 kill -9 then ./restart OK
7 v1.17.22 v1.18.2 kill -9 then ./restart OK
  • 1 through 5 are all graceful shutdowns, whereas 6 and 7 are not
  • 1-3 and 6 are all the same minor version, whereas 4 & 7 are upgrades, and 5 is a downgrade

I'm not sure what else to try at the moment. Are there other permutations I've missed?

@segfaultdoc
Copy link
Contributor

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:

Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...

Removing the arg fixed the issue

@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe systemctl stop sol), or are you using something like solana-validator exit ?

Similar to author, restarted the systemd service. I'm starting to think maybe we're not exiting cleanly in jito-solana. Do you know if a panic in some thread on exit would cause this?

@john-smith-solana
Copy link
Author

sudo reboot

See attached screenshot.

This shows Script B, Test Methodology 2. ls -lh was done seconds before the sudo reboot so its an accurate depiction of the ledger directory as the reboot was executed. The full snapshot has got to 28G out of ~58G and the last incremental was 7 minutes ago.

This Script B recovered successfully, in approx 19 minutes.

However, when --use-snapshot-archives-at-startup when-newest was added (Script A) it would fail to recover.

@brooksprumo
Copy link
Contributor

Do you know if a panic in some thread on exit would cause this?

This is what I was trying to reproduce by randomly killing the validator process in 6 & 7. It's possible I just didn't hit the issue too. Two runs is not a lot.


See attached screenshot.

The --use-snapshot-archives-at-startup when-newest cli arg does not use the snapshot archives, so in theory this should not impact anything. I'll try it out though.

If you happen to have the contents of what's in your /mnt/solana-ledger/snapshot directory, that would be interesting. I would expect it to have a directory with number higher than 24570877.

@john-smith-solana
Copy link
Author

That was a screenshot I took at the time (12 Feb, 1.17.20). Afriad I no longer have the server.

@michaelh-laine
Copy link
Contributor

michaelh-laine commented Feb 24, 2024

After just hitting a space issue on my snapshot dir I crashed with this:

thread 'solSnapshotPkgr' panicked at core/src/snapshot_packager_service.rs:81:26:
failed to archive snapshot package: Io(Custom { kind: StorageFull, error: Error { kind: Write, source: Os { code: 28, kind: StorageFull, message: "No space left on device" }, path: "/mnt/snapshots/tmp-snapshot-archive-250193679.tar.zst" } })

Which led to a boot crash loop with this being the only error there:

[2024-02-24T17:02:06.317297546Z ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/validator/accounts/run/250194513.13821324
    snapshot: /mnt/snapshots/snapshot/250196251/250196251

Further context on Discord: https://discord.com/channels/428295358100013066/689412830075551748/1210995187950686288

@brooksprumo
Copy link
Contributor

Ok, I've found the (an?) problem. The PR to fix it is here: #35350

@brooksprumo
Copy link
Contributor

Found the other problem. Here's a GH Issue for it: #35367.

@john-smith-solana
Copy link
Author

That’s great news Brooks - thanks

@brooksprumo
Copy link
Contributor

#35350 has been merged, so the recovery aspect of this failure is now fixed. Other PRs will fix the underlying issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants