Snapshot: Survive evolving of a rolled back VM, and minor refactorings. #3290

OhmSpectator · 2023-06-19T11:30:21Z

This PR includes a series of changes to improve the robustness and modularity of the snapshot rollback process in the pillar service. It addresses issues with pre-existing Volume Status and introduces a more flexible way of restoring snapshot data structures. It also ensures alignment with Go's newer file-writing idioms and deals with excess snapshots during an application restart.

Details of the changes are as follows:

Preventing fatal errors for pre-existing Volume Status
An enhancement to the snapshot rollback process has been made. In scenarios where the controller dispatches a new config with a rolled-back volume, the pre-existence of a volume status no longer triggers a fatal error. Instead, it triggers a warning, resulting in an early return from the function.

Passing SnapshotID directly to deserializeConfigFromSnapshot
The function deserializeConfigFromSnapshot has been modified to directly accept a SnapshotID string as an argument instead of the whole SnapshotInstanceStatus object. This allows the function to be used in scenarios where the full Status object is unavailable, but the SnapshotID is known.

Replacing ioutil.WriteFile with os.WriteFile
The deprecated ioutil.WriteFile function calls have been replaced with os.WriteFile in two functions (serializeVolumeRefStatusToSnapshot and serializeConfigToSnapshot). This aligns the code with Go's newer file writing idioms and removes the unnecessary import of the ioutil package.

Handling excess snapshots during application restart
The system now checks for snapshots marked for deletion before creating a new one and triggers their deletion. This keeps the snapshot list from exceeding the set limit and ensures accurate snapshot data is reported to the cloud.

Enhancing modularity of snapshot data structure restoration
Changes have been made to facilitate more modular handling of snapshots and to prepare for future additions of restore functions. This includes introducing a new function, adding addFixupsIntoSnappedConfig, and refactoring the triggerRollback function.

OhmSpectator · 2023-06-19T11:57:19Z

Commit message typo fixed

OhmSpectator · 2023-06-19T12:06:42Z

@milan-zededa, @eriknordmark, take a look at the PR, please.
It would be nice to merge it before we provide a proper fix for the reboot-after-rollback. An appropriate solution will require rework on both Cloud and EVE sides and will take some time.
Meanwhile, with this PR and the subsequent PR for Cloud, at least the main scenario (snapshot - rollback - delete) will work and can be demonstrated.

pkg/pillar/cmd/zedmanager/handlevolumemgr.go

eriknordmark

LGTM but some comment nits.

…oration. This commit restructures the handling of snapshots for improved modularity and extensibility. The key changes include: 1. Introduction of a new function, `addFixupsIntoSnappedConfig`, in `handlevolumemgr.go`. This function adds fixups to the snapped app instance config. These are elements that should be drawn from the current app instance config, rather than the snapshot. 2. Refactoring of the `triggerRollback` function in `updatestatus.go`. Restoration of `volumeRefStatuses` from the snapshot config is now handled by the new function `restorePresnapStatus`. These updates have been made with a view to facilitate future additions of restore functions. By separating the restoration of different data structures into dedicated functions, the code is more modular, easier to read, and can be extended more efficiently. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

This commit ensures that excess snapshots are deleted during the same application restart when a new one is created. Previously, EVE marked the old snapshot for deletion but kept it available until the next application restart. This led to a scenario where the list of available snapshots exceeded the set limit, causing confusion when reported to the Cloud, which currently only supports a single snapshot. Now, the system checks if there are snapshots marked for deletion before creating a new one and triggers their deletion. This prevents the snapshot list from exceeding the defined limit and ensures that the Cloud receives accurate snapshot data. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

Replaced deprecated ioutil.WriteFile function calls with os.WriteFile in serializeVolumeRefStatusToSnapshot and serializeConfigToSnapshot functions. This change aligns with Go's newer file writing idioms and removes the unnecessary import of the ioutil package. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

…napshot. Updated the function deserializeConfigFromSnapshot to directly take the SnapshotID string as an argument instead of the whole SnapshotInstanceStatus object. This change enables the function to be used in scenarios where the full Status object is not available, but the SnapshotID is known. Function calls and handling of snapshot directory were updated accordingly to align with this change. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

The Volume Status is locally reestablished during rollback. Following this, if the controller dispatches a new config featuring the rolled-back volume, it prompts an unclearly triggered handleDeferredVolumeCreate function. Notably, the preexistence of volume status will no longer yield a fatal error; it instead triggers a warning and an early function return. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

OhmSpectator · 2023-06-20T10:58:17Z

@eriknordmark, @milan-zededa, how should we continue with the PR?

eriknordmark · 2023-06-20T12:23:52Z

The eden runs fail in log_test for this and other PRs for the last two days, so that isn't related to this PR (but we need to separately fix that or disable log_test).

The Jeninks ztest runs show some failures which I tink are spurious (but one could potentially be related since it has to do with purge with error
ztestcommon.z_utils.TimeoutException: ***ERROR: App instance purgeRefresh increased bootTime from 2023-06-20 11:30:23.529990. AppInst: sc-supermicro-zc1-2023-06-20T10-48-259122 timed out. Timeout=1200, timeElapsed: 1200.2446908950806
FailureInfo: App sc-supermicro-zc1-2023-06-20T10-48-259122 BootTime DID NOT IncreaseoldBootTime: 2023-06-20 11:30:23.529990

But I'll ignore that as well and we'll continue to test on master. Thus merging this PR.

OhmSpectator · 2023-06-20T12:58:15Z

@eriknordmark, thanks! Could you please share the link to the results of ztest? Maybe I could evaluate it.

OhmSpectator requested review from eriknordmark and rvs as code owners June 19, 2023 11:30

OhmSpectator force-pushed the feature/snap-refactorings branch from 962fb1e to 6341af2 Compare June 19, 2023 11:57

eriknordmark reviewed Jun 19, 2023

View reviewed changes

pkg/pillar/cmd/zedmanager/handlevolumemgr.go Outdated Show resolved Hide resolved

eriknordmark approved these changes Jun 19, 2023

View reviewed changes

OhmSpectator added 5 commits June 19, 2023 16:22

OhmSpectator force-pushed the feature/snap-refactorings branch from 6341af2 to 96c7b45 Compare June 19, 2023 14:23

eriknordmark merged commit e60f31a into lf-edge:master Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot: Survive evolving of a rolled back VM, and minor refactorings. #3290

Snapshot: Survive evolving of a rolled back VM, and minor refactorings. #3290

OhmSpectator commented Jun 19, 2023

OhmSpectator commented Jun 19, 2023

OhmSpectator commented Jun 19, 2023

eriknordmark left a comment

OhmSpectator commented Jun 20, 2023

eriknordmark commented Jun 20, 2023

OhmSpectator commented Jun 20, 2023

Snapshot: Survive evolving of a rolled back VM, and minor refactorings. #3290

Snapshot: Survive evolving of a rolled back VM, and minor refactorings. #3290

Conversation

OhmSpectator commented Jun 19, 2023

OhmSpectator commented Jun 19, 2023

OhmSpectator commented Jun 19, 2023

eriknordmark left a comment

Choose a reason for hiding this comment

OhmSpectator commented Jun 20, 2023

eriknordmark commented Jun 20, 2023

OhmSpectator commented Jun 20, 2023