Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot: Survive evolving of a rolled back VM, and minor refactorings. #3290

Merged
merged 5 commits into from
Jun 20, 2023

Conversation

OhmSpectator
Copy link
Member

This PR includes a series of changes to improve the robustness and modularity of the snapshot rollback process in the pillar service. It addresses issues with pre-existing Volume Status and introduces a more flexible way of restoring snapshot data structures. It also ensures alignment with Go's newer file-writing idioms and deals with excess snapshots during an application restart.

Details of the changes are as follows:

Preventing fatal errors for pre-existing Volume Status
An enhancement to the snapshot rollback process has been made. In scenarios where the controller dispatches a new config with a rolled-back volume, the pre-existence of a volume status no longer triggers a fatal error. Instead, it triggers a warning, resulting in an early return from the function.

Passing SnapshotID directly to deserializeConfigFromSnapshot
The function deserializeConfigFromSnapshot has been modified to directly accept a SnapshotID string as an argument instead of the whole SnapshotInstanceStatus object. This allows the function to be used in scenarios where the full Status object is unavailable, but the SnapshotID is known.

Replacing ioutil.WriteFile with os.WriteFile
The deprecated ioutil.WriteFile function calls have been replaced with os.WriteFile in two functions (serializeVolumeRefStatusToSnapshot and serializeConfigToSnapshot). This aligns the code with Go's newer file writing idioms and removes the unnecessary import of the ioutil package.

Handling excess snapshots during application restart
The system now checks for snapshots marked for deletion before creating a new one and triggers their deletion. This keeps the snapshot list from exceeding the set limit and ensures accurate snapshot data is reported to the cloud.

Enhancing modularity of snapshot data structure restoration
Changes have been made to facilitate more modular handling of snapshots and to prepare for future additions of restore functions. This includes introducing a new function, adding addFixupsIntoSnappedConfig, and refactoring the triggerRollback function.

@OhmSpectator OhmSpectator force-pushed the feature/snap-refactorings branch from 962fb1e to 6341af2 Compare June 19, 2023 11:57
@OhmSpectator
Copy link
Member Author

Commit message typo fixed

@OhmSpectator
Copy link
Member Author

@milan-zededa, @eriknordmark, take a look at the PR, please.
It would be nice to merge it before we provide a proper fix for the reboot-after-rollback. An appropriate solution will require rework on both Cloud and EVE sides and will take some time.
Meanwhile, with this PR and the subsequent PR for Cloud, at least the main scenario (snapshot - rollback - delete) will work and can be demonstrated.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but some comment nits.

…oration.

This commit restructures the handling of snapshots for improved
modularity and extensibility. The key changes include:

1. Introduction of a new function, `addFixupsIntoSnappedConfig`, in
   `handlevolumemgr.go`. This function adds fixups to the snapped app
   instance config. These are elements that should be drawn from the
   current app instance config, rather than the snapshot.

2. Refactoring of the `triggerRollback` function in `updatestatus.go`.
   Restoration of `volumeRefStatuses` from the snapshot config is now
   handled by the new function `restorePresnapStatus`.

These updates have been made with a view to facilitate future additions
of restore functions. By separating the restoration of different data
structures into dedicated functions, the code is more modular, easier to
read, and can be extended more efficiently.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit ensures that excess snapshots are deleted during the same
application restart when a new one is created.

Previously, EVE marked the old snapshot for deletion but kept it available
until the next application restart. This led to a scenario where the list of
available snapshots exceeded the set limit, causing confusion when reported to
the Cloud, which currently only supports a single snapshot.

Now, the system checks if there are snapshots marked for deletion before
creating a new one and triggers their deletion. This prevents the snapshot list
from exceeding the defined limit and ensures that the Cloud receives accurate
snapshot data.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Replaced deprecated ioutil.WriteFile function calls with os.WriteFile in
serializeVolumeRefStatusToSnapshot and serializeConfigToSnapshot
functions. This change aligns with Go's newer file writing idioms and
removes the unnecessary import of the ioutil package.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…napshot.

Updated the function deserializeConfigFromSnapshot to directly take the
SnapshotID string as an argument instead of the whole
SnapshotInstanceStatus object. This change enables the function to be
used in scenarios where the full Status object is not available, but the
SnapshotID is known. Function calls and handling of snapshot directory
were updated accordingly to align with this change.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
The Volume Status is locally reestablished during rollback. Following this, if
the controller dispatches a new config featuring the rolled-back volume, it
prompts an unclearly triggered handleDeferredVolumeCreate function. Notably,
the preexistence of volume status will no longer yield a fatal error; it
instead triggers a warning and an early function return.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
@OhmSpectator OhmSpectator force-pushed the feature/snap-refactorings branch from 6341af2 to 96c7b45 Compare June 19, 2023 14:23
@OhmSpectator
Copy link
Member Author

@eriknordmark, @milan-zededa, how should we continue with the PR?

@eriknordmark
Copy link
Contributor

The eden runs fail in log_test for this and other PRs for the last two days, so that isn't related to this PR (but we need to separately fix that or disable log_test).

The Jeninks ztest runs show some failures which I tink are spurious (but one could potentially be related since it has to do with purge with error
ztestcommon.z_utils.TimeoutException: ***ERROR: App instance purgeRefresh increased bootTime from 2023-06-20 11:30:23.529990. AppInst: sc-supermicro-zc1-2023-06-20T10-48-259122 timed out. Timeout=1200, timeElapsed: 1200.2446908950806
FailureInfo: App sc-supermicro-zc1-2023-06-20T10-48-259122 BootTime DID NOT IncreaseoldBootTime: 2023-06-20 11:30:23.529990

But I'll ignore that as well and we'll continue to test on master. Thus merging this PR.

@eriknordmark eriknordmark merged commit e60f31a into lf-edge:master Jun 20, 2023
@OhmSpectator
Copy link
Member Author

@eriknordmark, thanks! Could you please share the link to the results of ztest? Maybe I could evaluate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants