sudo reboot mb validator recovery fails #35190

john-smith-solana · 2024-02-13T23:24:26Z

Problem

1.17.20, 10-12 Feb 2024.

I was curious about a mb validator’s recovery ability. And so I used a spare non-voting mb validator to see if it could recover from an abrupt sudo reboot

I ran default validator settings, so incremental snapshots happen every minute and full-snapshots every 3 hours.

I tried 2 different validator startup scripts:
Script A: included —use-snapshot-archives-at-startup when newest
Script B: it was removed

Test Methodology 1:

sudo reboot when incremental snapshots are available and less than 1 minute old:

Script A: 3 reboots, 3 successful recoveries each in approx 13 mins
Script B: 3 reboots, 3 successful recoveries each in approx 15 mins

So far so good!

Test Methodology 2:

However, approximately 10 mins before the 3-hour full snapshot is due, the validator stops creating minute-by-minute incrementals and starts only creating the next full snapshot. This means the last incremental gets up to ~15 mins old. [At least, this is my interpretation of what it looks like it's doing!].

sudo reboot at various times with an old/aging incremental during full snapshot creation (for clarity this is approx a 15 minute window every 3 hours):

Script A:
Incremental 8 mins old: Failed
Incremental 6 mins old: Failed
Incremental 5 mins old: Failed

Script B:
Incremental 7 mins old: Success, took 19 mins
Incremental 9 mins old: Success, took 20 mins
Incremental 13 mins old: Success, took 23 mins

For Test Methodology 2 using Script A, each time it failed for ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/solana-accounts/run/247471897.30603

Proposed Solution

On the advice of Brooks in the discord this issue is opened to address Script A - Test Methodology 2 - failing.

The text was updated successfully, but these errors were encountered:

brooksprumo · 2024-02-14T00:30:20Z

I'll take a look. Thanks for filling this issue.

segfaultdoc · 2024-02-17T18:55:12Z

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:

Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...

Removing the arg fixed the issue

steviez · 2024-02-17T19:19:21Z

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:
Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...
Removing the arg fixed the issue

@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe systemctl stop sol), or are you using something like solana-validator exit ?

brooksprumo · 2024-02-21T15:17:01Z

So far I have been unable to reproduce a failure with fastboot. Here's the experiments I've performed so far. For all of them I have specified --use-snapshot-archives-at-startup when-newest on the cli.

Note that the terminology may not make sense, as this is copy-pasted from my own internal notes.

#	initial version	restart version	restart method	result
1	v1.17.23	v1.17.23	./restart with bank snapshot POST	OK
2	v1.17.23	v1.17.23	./restart with bank snapshot PRE	OK, uses next POST correctly
3	v1.17.23	v1.17.23	./stop then ./restart withOUT a bank snapshot, just account hard links	OK, uses next POST correctly
4	v1.17.23	v1.18.3	./stop then ./restart withOUT a bank snapshot, just account hard links	OK, uses next POST correctly
5	v1.18.3	v1.17.22	./restart with bank snapshot POST	OK
6	v1.17.22	v1.17.22	kill -9 then ./restart	OK
7	v1.17.22	v1.18.2	kill -9 then ./restart	OK

1 through 5 are all graceful shutdowns, whereas 6 and 7 are not
1-3 and 6 are all the same minor version, whereas 4 & 7 are upgrades, and 5 is a downgrade

I'm not sure what else to try at the moment. Are there other permutations I've missed?

segfaultdoc · 2024-02-21T19:01:13Z

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:
Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...
Removing the arg fixed the issue
@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe systemctl stop sol), or are you using something like solana-validator exit ?

Similar to author, restarted the systemd service. I'm starting to think maybe we're not exiting cleanly in jito-solana. Do you know if a panic in some thread on exit would cause this?

john-smith-solana · 2024-02-21T20:51:30Z

See attached screenshot.

This shows Script B, Test Methodology 2. ls -lh was done seconds before the sudo reboot so its an accurate depiction of the ledger directory as the reboot was executed. The full snapshot has got to 28G out of ~58G and the last incremental was 7 minutes ago.

This Script B recovered successfully, in approx 19 minutes.

However, when --use-snapshot-archives-at-startup when-newest was added (Script A) it would fail to recover.

brooksprumo · 2024-02-21T20:55:29Z

Do you know if a panic in some thread on exit would cause this?

This is what I was trying to reproduce by randomly killing the validator process in 6 & 7. It's possible I just didn't hit the issue too. Two runs is not a lot.

See attached screenshot.

The --use-snapshot-archives-at-startup when-newest cli arg does not use the snapshot archives, so in theory this should not impact anything. I'll try it out though.

If you happen to have the contents of what's in your /mnt/solana-ledger/snapshot directory, that would be interesting. I would expect it to have a directory with number higher than 24570877.

john-smith-solana · 2024-02-21T21:03:27Z

That was a screenshot I took at the time (12 Feb, 1.17.20). Afriad I no longer have the server.

michaelh-laine · 2024-02-24T17:12:46Z

After just hitting a space issue on my snapshot dir I crashed with this:

thread 'solSnapshotPkgr' panicked at core/src/snapshot_packager_service.rs:81:26:
failed to archive snapshot package: Io(Custom { kind: StorageFull, error: Error { kind: Write, source: Os { code: 28, kind: StorageFull, message: "No space left on device" }, path: "/mnt/snapshots/tmp-snapshot-archive-250193679.tar.zst" } })

Which led to a boot crash loop with this being the only error there:

[2024-02-24T17:02:06.317297546Z ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/validator/accounts/run/250194513.13821324
    snapshot: /mnt/snapshots/snapshot/250196251/250196251

Further context on Discord: https://discord.com/channels/428295358100013066/689412830075551748/1210995187950686288

brooksprumo · 2024-02-28T15:44:34Z

Ok, I've found the (an?) problem. The PR to fix it is here: #35350

brooksprumo · 2024-02-29T02:50:13Z

Found the other problem. Here's a GH Issue for it: #35367.

john-smith-solana · 2024-02-29T14:33:48Z

That’s great news Brooks - thanks

brooksprumo · 2024-02-29T19:33:04Z

#35350 has been merged, so the recovery aspect of this failure is now fixed. Other PRs will fix the underlying issues.

john-smith-solana added the community Community contribution label Feb 13, 2024

brooksprumo self-assigned this Feb 14, 2024

brooksprumo mentioned this issue Feb 28, 2024

Purges all bank snapshots after fastboot #35350

Merged

brooksprumo mentioned this issue Feb 28, 2024

Adds more info to panic message in AccountsHashVerifier #35353

Merged

brooksprumo mentioned this issue Feb 29, 2024

Fastboot fails if the node crashes while archiving a full snapshot #35367

Closed

brooksprumo mentioned this issue Feb 29, 2024

Fastboot fails if loading from the same snapshot multiple times #35376

Closed

brooksprumo closed this as completed in #35350 Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sudo reboot mb validator recovery fails #35190

sudo reboot mb validator recovery fails #35190

john-smith-solana commented Feb 13, 2024 •

edited

Loading

brooksprumo commented Feb 14, 2024

segfaultdoc commented Feb 17, 2024

steviez commented Feb 17, 2024

brooksprumo commented Feb 21, 2024

segfaultdoc commented Feb 21, 2024

john-smith-solana commented Feb 21, 2024

brooksprumo commented Feb 21, 2024

john-smith-solana commented Feb 21, 2024

michaelh-laine commented Feb 24, 2024 •

edited

Loading

brooksprumo commented Feb 28, 2024

brooksprumo commented Feb 29, 2024

john-smith-solana commented Feb 29, 2024

brooksprumo commented Feb 29, 2024

sudo reboot mb validator recovery fails #35190

sudo reboot mb validator recovery fails #35190

Comments

john-smith-solana commented Feb 13, 2024 • edited Loading

Problem

Proposed Solution

brooksprumo commented Feb 14, 2024

segfaultdoc commented Feb 17, 2024

steviez commented Feb 17, 2024

brooksprumo commented Feb 21, 2024

segfaultdoc commented Feb 21, 2024

john-smith-solana commented Feb 21, 2024

brooksprumo commented Feb 21, 2024

john-smith-solana commented Feb 21, 2024

michaelh-laine commented Feb 24, 2024 • edited Loading

brooksprumo commented Feb 28, 2024

brooksprumo commented Feb 29, 2024

john-smith-solana commented Feb 29, 2024

brooksprumo commented Feb 29, 2024

john-smith-solana commented Feb 13, 2024 •

edited

Loading

michaelh-laine commented Feb 24, 2024 •

edited

Loading