Properly handle overlay syncing failures #116

Vogtinator · 2023-10-23T11:06:23Z

Previously, the code assumed that syncing always succeeds and only preserved the lowest layer of the parent snapshot. This results in the data of the dropped layers to be lost. Detect if syncing did not happen and preserve the layers.

andreygolev · 2024-04-29T18:28:47Z

Seems like we're affected by the same issue.
Our /etc is randomly lost along with OS updates, while overlay of older snapshot contains all changes we made in /etc

laenion · 2024-04-29T19:38:34Z

I'm currently working on a rework of overlay handling so that it doesn't rely on older snapshots.

However I'm wondering @andreygolev: The problem Vogtinator was fixing only occurs when the parent snapshot of the current one is deleted. This will only happen when you create multiple (by default > 5) new snapshots before a reboot, delete the previous snapshot manually or when snapper is configured to only preserve one snapshot. Is this the case in your setup?

andreygolev · 2024-04-30T06:10:24Z

According to logs, there was just 1 reboot in 7 days for last affected node, while transaction-update is running daily.
So, seems like this is the first case.

andi0b · 2024-05-05T12:49:28Z

It looks like the users of the project kube-hetzner are strongly affected by this or a similar issue. Some nodes seem to revert back to the stock /etc after a reboot, which is a catastrophic situation, as no services start up and even the network settings are gone (node unreachable).

My working theory how we run into this issue is roughly:

Set up OpenSUSE Micro OS, with k3s and kured
Keep default daily updates with transactional-update.timer enabled
Having some kubernetes workloads running, that prevent kured to reboot the node (this might happen unnoticed, or fixing it might take some time)
transactional-update keeps running daily and keeps creating snapshots
After a few days (probably 10-40 days) old snapshots get cleaned up
The first reboot into an updated system works fine, because it's still using the /etc overlay
On the second reboot into an updated system /etc doesn't get merged and all customizations are lost (/etc is reverted back to the "stock" /etc from the Micro OS installation)

This working theory is supported by finding messages like Parent snapshot 3 does not exist any more - skipping rsync in the logs.

We know that it is recommended to reboot as soon as possible after running transactional-update, but the reality is that the reboot does not always happen in a timely manner.

There is a longer discussion here: kube-hetzner/terraform-hcloud-kube-hetzner#1287

There are a few more issues reported by users, but mostly they are unsolved, because people might just recreate the nodes, switch to another project, or give up, instead of investigating it thoroughly.

mysticaltech · 2024-05-05T13:39:53Z

@Vogtinator @sysrich That will be a life saver for our project https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner, we are loosing nodes because of that issue.

laenion · 2024-05-06T09:26:32Z

Thanks a lot to all involved (and especially @Vogtinator for the patch and @andi0b for detailed breakdown here and in kube-hetzner/terraform-hcloud-kube-hetzner#1287 (reply in thread)). It seems this problem affects several people, so I won't wait for the reworked overlay handling, but apply the pull request immediately.

andi0b · 2024-05-06T11:06:55Z

@laenion Thanks! I just want to add that I didn't test this PR at all, or test if this fixes our issue. I stumbled upon it and wanted to highlight the severity of the issue.

laenion · 2024-05-06T11:09:08Z

No worries: I tested it and also verified that it actually solves the problem ;-)

https://build.opensuse.org/request/show/1172470 by user fos + dimstar_suse - Version 4.6.8 - tukit: Properly handle overlay syncing failures: If the system would not be rebooted and several snapshots accumulated in the meantime, it was possible that the previous base snapshot - required for /etc syncing - was deleted already. In that case changes in /etc might have been reset. [gh#openSUSE/transactional-update#116] [gh#kube-hetzner/terraform-hcloud-kube-hetzner#1287] - soft-reboot: Log requested reboot type - soft-reboot: Don't force hard reboot on version change only - Version 4.6.7 - Add support for snapper 0.11.0; also significantly decreases cleanup time [boo#1223504]

Properly handle overlay syncing failures

9e6feeb

Previously, the code assumed that syncing always succeeds and only preserved the lowest layer of the parent snapshot. This results in the data of the dropped layers to be lost. Detect if syncing did not happen and preserve the layers.

laenion merged commit daf0098 into openSUSE:master May 6, 2024

Vogtinator deleted the nosyncfailmaster branch May 6, 2024 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle overlay syncing failures #116

Properly handle overlay syncing failures #116

Vogtinator commented Oct 23, 2023

andreygolev commented Apr 29, 2024

laenion commented Apr 29, 2024

andreygolev commented Apr 30, 2024

andi0b commented May 5, 2024 •

edited

Loading

mysticaltech commented May 5, 2024

laenion commented May 6, 2024

andi0b commented May 6, 2024

laenion commented May 6, 2024 •

edited

Loading

Properly handle overlay syncing failures #116

Properly handle overlay syncing failures #116

Conversation

Vogtinator commented Oct 23, 2023

andreygolev commented Apr 29, 2024

laenion commented Apr 29, 2024

andreygolev commented Apr 30, 2024

andi0b commented May 5, 2024 • edited Loading

mysticaltech commented May 5, 2024

laenion commented May 6, 2024

andi0b commented May 6, 2024

laenion commented May 6, 2024 • edited Loading

andi0b commented May 5, 2024 •

edited

Loading

laenion commented May 6, 2024 •

edited

Loading