Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checks when constructing a BankSnapshotInfo from a directory #30373

Merged
merged 12 commits into from
Feb 24, 2023

Conversation

xiangzhu70
Copy link
Contributor

@xiangzhu70 xiangzhu70 commented Feb 16, 2023

Problem

Before constructing a bank from a snapshot, we need to read the snapshot directories and find the highest correct one with the necessary meta info files. Sometimes a directory may have partial information, not fully ready to be used for bank construction. Only the ones passing the checks will be selected.

Summary of Changes

Add snapshot version into BankSnapshotInfo, fill that from the version file;
Add checks to ensure only a correct directory is used to generate a BankSnapshotInfo.
A function to find the one with the highest slot, which is a wrapper of get_bank_snapshots and selecting the highest one.
Add the test function test_get_highest_bank_snapshot. It removes the meta files in the directories, and check if it gets the highest correct snapshot.

This is one of the split PRs of #28745. It precedes #30171

Fixes #

@xiangzhu70 xiangzhu70 marked this pull request as ready for review February 17, 2023 13:28
@brooksprumo
Copy link
Contributor

For fastboot, we'll always need the complete accounts state in order to actually boot from that bank snapshot later. I'm not sure I understand the problem that's being solved here; get_highest_bank_snapshot() already gets us this information.

@xiangzhu70
Copy link
Contributor Author

For fastboot, we'll always need the complete accounts state in order to actually boot from that bank snapshot later. I'm not sure I understand the problem that's being solved here; get_highest_bank_snapshot() already gets us this information.

In #30171 I follow the archiving path to get a full snapshot and an incremental snapshot. This is part of that PR. I thought it might be better that I split these parts into a small PR.

@@ -162,31 +164,52 @@ impl BankSnapshotInfo {
pub fn new_from_dir(
bank_snapshots_dir: impl AsRef<Path>,
slot: Slot,
) -> Option<BankSnapshotInfo> {
) -> Result<BankSnapshotInfo> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this being a Result much more than an Option!

snapshot_dir: bank_snapshot_dir,
});
let status_cache_file = bank_snapshot_dir.join(SNAPSHOT_STATUS_CACHE_FILENAME);
if fs::metadata(status_cache_file).is_err() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be more explicit with our check here? The status_cache_file should specifically be a file, not just exist (potentially as a dir), right?

Suggested change
if fs::metadata(status_cache_file).is_err() {
if !status_cache_file.is_file() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

});
let status_cache_file = bank_snapshot_dir.join(SNAPSHOT_STATUS_CACHE_FILENAME);
if fs::metadata(status_cache_file).is_err() {
return Err(SnapshotError::MissingStatusCacheFile(bank_snapshot_dir));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error should be the status cache file path I think:

Suggested change
return Err(SnapshotError::MissingStatusCacheFile(bank_snapshot_dir));
return Err(SnapshotError::MissingStatusCacheFile(status_cache_file));

Copy link
Contributor Author

@xiangzhu70 xiangzhu70 Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 182 to 184
let version_str = snapshot_version_from_file(version_path)?;
let snapshot_version = SnapshotVersion::from_str(version_str.as_str())
.or(Err(SnapshotError::InvalidVersion))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope:

Seems like these are reasonable to replace the snapshot_version_from_file. Do we ever actually need the raw string except for (de)serialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could combine the actions to a function SnapshotVesion::from_file()

// There is a time window from the slot directory being created, and the content being completely
// filled. Check the completion to avoid using a highest found slot directory with missing content.
let completion_flag_file = bank_snapshot_dir.join(SNAPSHOT_STATE_COMPLETE_FILENAME);
if fs::metadata(completion_flag_file).is_err() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I think we should just be as explicit as possible that completion_flag_file is a file:

Suggested change
if fs::metadata(completion_flag_file).is_err() {
if !completion_flag_file.is_file() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines 197 to 205
let snapshot_type: Option<BankSnapshotType> = if bank_snapshot_pre_path.is_file() {
Some(BankSnapshotType::Pre)
} else if bank_snapshot_post_path.is_file() {
Some(BankSnapshotType::Post)
} else {
None
};
let snapshot_type = snapshot_type
.ok_or_else(|| SnapshotError::MissingSnapshotFile(bank_snapshot_dir.clone()))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make this a bit more direct:

Suggested change
let snapshot_type: Option<BankSnapshotType> = if bank_snapshot_pre_path.is_file() {
Some(BankSnapshotType::Pre)
} else if bank_snapshot_post_path.is_file() {
Some(BankSnapshotType::Post)
} else {
None
};
let snapshot_type = snapshot_type
.ok_or_else(|| SnapshotError::MissingSnapshotFile(bank_snapshot_dir.clone()))?;
let snapshot_type = if bank_snapshot_pre_path.is_file() {
BankSnapshotType::Pre
} else if bank_snapshot_post_path.is_file() {
BankSnapshotType::Post
} else {
return Err(SnapshotError::MissingSnapshotFile(bank_snapshot_dir));
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. updated.

/// as a valid one. A dir unpacked from an archive lacks these files. Fill them here to
/// allow new_from_dir() checks to pass. These checks are not needed for unpacked dirs,
/// but it is not clean to add another flag to new_from_dir() to skip them.
fn fill_snapshot_meta_files_for_unchived_snapshot(unpack_dir: impl AsRef<Path>) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note you have to fix calls to this function, so I wouldn't actually commit this suggestion.)

Suggested change
fn fill_snapshot_meta_files_for_unchived_snapshot(unpack_dir: impl AsRef<Path>) -> Result<()> {
fn fill_snapshot_meta_files_for_unarchived_snapshot(unpack_dir: impl AsRef<Path>) -> Result<()> {

Copy link
Contributor Author

@xiangzhu70 xiangzhu70 Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@xiangzhu70
Copy link
Contributor Author

For fastboot, we'll always need the complete accounts state in order to actually boot from that bank snapshot later. I'm not sure I understand the problem that's being solved here; get_highest_bank_snapshot() already gets us this information.

In #30171 I follow the archiving path to get a full snapshot and an incremental snapshot. This is part of that PR. I thought it might be better that I split these parts into a small PR.

Discussed with @brooksprumo offline. He said " An incremental snapshot is only possible as an archive. As a bank snapshot, I don't see this being a good idea because then this bank snapshot could not be used for fastboot.". I'd like to double check with @jeffwashington and @apfitzge, because I recall some mentioning of an incremental snapshot before.

If we all agreed that a bank snapshot directory is always a full snapshot, then I will remove all the incremental snapshot directory logic from #30171 and this PR.

@brooksprumo
Copy link
Contributor

"Full" and "incremental" are currently only terms in the context of snapshot archives.
Bank snapshots used to be the full state minus the account storage files.
For fastboot, the storages are now hardlinked into the bank snapshot directory.

Without all the storages in the bank snapshot directory, it would not be usable for fastboot by itself.

If a bank snapshot only had a subset of the storages (i.e. an incremental bank snapshot), where would the remaining storages come from? I think this would require pointing back to an older bank snapshot that had all the storages (i.e. a full bank snapshot). To support this model, we'd have to hold onto all those storages too. So worst case is a full duplication of all storages (100+ GB). It's probably not going to be that bad, but around 25,000 slots will likely touch lots of storages, and it would be important to see how much extra disk space would be required to support this. I don't think we should include that work in a v0 for getting fastboot working.

@xiangzhu70 xiangzhu70 changed the title Find the highest snapshot from the directories Add checks when constructing a BankSnapshotInfo from a directory Feb 17, 2023
@brooksprumo
Copy link
Contributor

@xiangzhu70 Can you update the PR description to describe the problem this PR is solving?

@xiangzhu70
Copy link
Contributor Author

@xiangzhu70 Can you update the PR description to describe the problem this PR is solving?

Updated the problem section. Please check if it is OK now. Thanks! @brooksprumo

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this PR is doing three distinct things, and I think should be split up:

  1. Adding a new function to get the highest bank snapshot, regardless of Pre/Post

We can keep this new function, as it communicates intent correctly. One interesting note is that with the current implementation, all bank snapshots on the snapshots dir are Pre. The only time there is a Post is within the tempdir created by AccountsPackage. So get_highest_bank_snapshot_pre() will return what is needed in this case.

  1. Dealing with an issue for snapshot archives where some of these new files are missing but need to be there for BankSnapshotInfo::new_from_dir()

After we've unarchived a snapshot archive, we don't have a bank snapshot in the snapshots directory, so I don't understand why this is now required. Did that behavior change?

  1. Parsing a bank snapshot and extracting its BankSnapshotInfo

I like these changes! I think to make this cleaner, the new errors should go under their own new error enum type, and then a single new error goes in SnapshotError. Refer to VerifySlotDeltasError for an example of what I'm describing.

@xiangzhu70
Copy link
Contributor Author

xiangzhu70 commented Feb 17, 2023

After we've unarchived a snapshot archive, we don't have a bank snapshot in the snapshots directory, so I don't understand why this is now required. Did that behavior change?

Because the snapshot from dir case and the snapshot from archive share the get_bank_snapshots() and new_from_dir() functions. To let a BankSnapshotInfo be returned after the directory checks in new_from_dir, I have to make the unarchived snapshot directories contain the meta files, even they are only used in snapshot_from_dir case.

@xiangzhu70
Copy link
Contributor Author

xiangzhu70 commented Feb 17, 2023

I think should be split up

How should it be split up? Among the 3 issues, #2 is really just a result of checks introduced in #3, so they should go together. #1 is just a wrapper function which not much content. I can either remove it (just use get_highest_bank_snapshot_pre instead, although it carries implementation assumption) or move it back to PR #30171. Which way do you prefer? But the test function tests if it can find the highest valid one passing all the checks, so it is also a result of #3. In this regard it looks better to stay instead of being split out.

@xiangzhu70
Copy link
Contributor Author

I think to make this cleaner, the new errors should go under their own new error enum type,

It is now done. git pushed.

runtime/src/snapshot_utils.rs Show resolved Hide resolved
runtime/src/snapshot_utils.rs Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
@brooksprumo
Copy link
Contributor

After we've unarchived a snapshot archive, we don't have a bank snapshot in the snapshots directory, so I don't understand why this is now required. Did that behavior change?

Because the snapshot from dir case and the snapshot from archive share the get_bank_snapshots() and new_from_dir() functions. To let a BankSnapshotInfo be returned after the directory checks in new_from_dir, I have to make the unarchived snapshot directories contain the meta files, even they are only used in snapshot_from_dir case.

Can you link me to the code you're referring to, please? I'm not finding other code that calls get_bank_snapshots() other than in snapshot_utils.rs or tests.

@xiangzhu70
Copy link
Contributor Author

xiangzhu70 commented Feb 21, 2023

Can you link me to the code you're referring to, please? I'm not finding other code that calls get_bank_snapshots() other than in snapshot_utils.rs or tests.

test_bank_fields_from_snapshot
I debugged that failure, added the function fill_snapshot_meta_files_for_unarchived_snapshot so that this test can pass. You can break at get_bank_snapshots() when running this test to see how it gets called. There it returns no snapshot because the checking for meta info failed.

@brooksprumo
Copy link
Contributor

Can you link me to the code you're referring to, please? I'm not finding other code that calls get_bank_snapshots() other than in snapshot_utils.rs or tests.

test_bank_fields_from_snapshot I debugged that failure, added the function fill_snapshot_meta_files_for_unarchived_snapshot so that this test can pass. You can break at get_bank_snapshots() when running this test to see how it gets called. There it returns no snapshot because the checking for meta info failed.

Is it correct to say that calling fill_snapshot_meta_files_for_unarchived_snapshot() in unarchive_snapshot() was added to fix one test? If yes, I think a better solution is to fix the test. Since a snapshot archive is different from a bank snapshot, it seems incorrect to call BankSnapshotInfo::new_from_dir() on it. IOW, after inflating a snapshot archive there is not a bank snapshot in its own directory in the same way that a bank snapshot exists pre archival.

@xiangzhu70
Copy link
Contributor Author

xiangzhu70 commented Feb 22, 2023

since a snapshot archive is different from a bank snapshot, it seems incorrect to call BankSnapshotInfo::new_from_dir() on it.

But it is the existing code on archive test which calls get_bank_snapshots to get back Vec<BankSnapshotInfo>. So, the code is passing BankSnapshotInfo objects in the unarchiving operation. It is just the shared operation to get back snapshot info from a directory, regardless of whether the directory is the operation snapshot directory, or an unpacked directory from an archive. It looks fine to me. Could you clarify how you want it to be fixed?

@xiangzhu70
Copy link
Contributor Author

Is it correct to say that calling fill_snapshot_meta_files_for_unarchived_snapshot() in unarchive_snapshot() was added to fix one test?

Yes. it fixes test_bank_fields_from_snapshot

@brooksprumo
Copy link
Contributor

I pulled down this branch to understand what is happening. Here's one of the call stacks that fails:

bank_from_snapshot_archives
rebuild_bank_from_snapshots
verify_unpacked_snapshots_dir_and_version
get_bank_snapshots_post
get_bank_snapshots
BankSnapshotInfo::new_from_dir

Since bank_from_snapshot_archives (or bank_from_latest_snapshot_archives) is called in multiple tests, it's not just the one test_bank_fields_from_snapshot that's the issue, but many.

Inside rebuild_bank_from_snapshots(), we need the path to the inflated bank snapshot file, which is used to reconstruct the Bank and AccountsDb. The other fields within BankSnapshotInfo are not needed.

So we're kind of repurposing the get_bank_snapshots() functions just to get the path to the snapshot file in the archive, which is different from the original intent of finding the bank snapshots within the snapshots/ directory.

This means we have two use cases for the bank snapshots:

  1. The existing usage to get the path to the bank snapshot file within an inflated archive
  2. The new/upcoming usage to get the bank snapshot directories, which will be used for fastboot

Since we have these two use cases, it makes sense to me to have two different functions for getting the bank snapshot within their respective contexts, returning each's pertinent information. If fill_snapshot_meta_files_for_unarchived_snapshot was only needed in tests, that would be more OK, but since we're using it to patch up using an existing function in a different context, I'd like to find a better solution.

@xiangzhu70
Copy link
Contributor Author

xiangzhu70 commented Feb 22, 2023

it makes sense to me to have two different functions for getting the bank snapshot within their respective contexts, returning each's pertinent information.

I think it depends on the perception. There are two possible perceptions.

  1. There are two kinds of snapshots, one in the operation directory, one from archive, then there are two different contexts asking for two different functions.

  2. An archive contains a snapshot, a snapshot unpacked from an archive is still a snapshot in the same form as one from the operation directory. Then the same function applies to that snapshot regardless of its source.

If we follow the 1st perception, the function in the unarchiving context will need to avoid passing BankSnapshotInfo (since that struct is associated with only a snapshot in operation), but the logic about finding the directory and then the file in it would be duplicated.

I think the 2nd perception is fine in theory, because we can think a snapshot (a full state in directory) goes in and out of an archive. But the caveat here is that we don't want to change the existing archive format for compatibility reason, so we are not really adding the full state meta into the archive, but instead adding the meta files back at the time of unpacking to make it a full state snapshot dir.

I think the 2nd perception is better, despite the caveat.

Anyway, in the long term, I believe the clean way would be to split the bank-from-archive flow into two separate pipeline stages:

  1. unpack an archive to a bank snapshot directory,
  2. construct a bank from the snapshot directory.

At stage 1, just unpack the archive to the files. Information goes between archive and files. There is no any memory data structure (BankSnapshotInfo etc) involved.
At stage 2, deserialize file contents into the memory data structures. Information goes between memory data structures and files. There is no any unarchiving logic involved.

In that way, a bank will always come immediately from a snapshot dir, never from an archive. There will be clean logic separation. The code will be much simpler. If we go down that path, the code of constructing a bank directly from an archive will be obsolete, so the discussion above about how to perceive the logic in it will not matter.

@brooksprumo
Copy link
Contributor

Anyway, in the long term, I believe the clean way would be to split the bank-from-archive flow into two separate pipeline stages:

  1. unpack an archive to a bank snapshot directory,
  2. construct a bank from the snapshot directory.

At stage 1, just unpack the archive to the files. There is no any memory data structure (BankSnapshotInfo etc) involved. At stage 2, deserialize file contents into the memory data structures. There is no any unarchiving logic involved.

In that way, a bank will always come immediately from a snapshot dir, never from an archive. There will be clean logic separation. The code will be much simpler. If we go down that path, the code of constructing a bank directly from an archive will be obsolete, so the discussion above about how to perceive the logic in it will not matter.

I like this plan! And then we no longer will need a distinction between bank snapshot and snapshot archive; it becomes just snapshot and archive.

Ok, I'm on board with option 2 now. I'll re-review with this plan in mind.

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@xiangzhu70 xiangzhu70 merged commit 510be7b into solana-labs:master Feb 24, 2023
nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Mar 12, 2023
…ana-labs#30373)

* Read snapshot directories and find the highest

* format check

* Fix test_get_bank_snapshots

* fix test_bank_fields_from_snapshot

* review changes, is_file etc

* removed incremental snapshot

* SnapshotNewFromDirError

* nit and comments issues

* NewFromDir(#[from] SnapshotNewFromDirError)

* change fill to create

* replace unwrap with map_err and ok_or_else

* Remove BankForks, fix the bank loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants