Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle overlay syncing failures #116

Merged
merged 1 commit into from
May 6, 2024

Conversation

Vogtinator
Copy link
Member

Previously, the code assumed that syncing always succeeds and only preserved the lowest layer of the parent snapshot. This results in the data of the dropped layers to be lost. Detect if syncing did not happen and preserve the layers.

Previously, the code assumed that syncing always succeeds and only preserved
the lowest layer of the parent snapshot. This results in the data of the
dropped layers to be lost. Detect if syncing did not happen and preserve the
layers.
@andreygolev
Copy link

Seems like we're affected by the same issue.
Our /etc is randomly lost along with OS updates, while overlay of older snapshot contains all changes we made in /etc

@laenion
Copy link
Collaborator

laenion commented Apr 29, 2024

I'm currently working on a rework of overlay handling so that it doesn't rely on older snapshots.

However I'm wondering @andreygolev: The problem Vogtinator was fixing only occurs when the parent snapshot of the current one is deleted. This will only happen when you create multiple (by default > 5) new snapshots before a reboot, delete the previous snapshot manually or when snapper is configured to only preserve one snapshot. Is this the case in your setup?

@andreygolev
Copy link

According to logs, there was just 1 reboot in 7 days for last affected node, while transaction-update is running daily.
So, seems like this is the first case.

@andi0b
Copy link

andi0b commented May 5, 2024

It looks like the users of the project kube-hetzner are strongly affected by this or a similar issue. Some nodes seem to revert back to the stock /etc after a reboot, which is a catastrophic situation, as no services start up and even the network settings are gone (node unreachable).

My working theory how we run into this issue is roughly:

  1. Set up OpenSUSE Micro OS, with k3s and kured
  2. Keep default daily updates with transactional-update.timer enabled
  3. Having some kubernetes workloads running, that prevent kured to reboot the node (this might happen unnoticed, or fixing it might take some time)
  4. transactional-update keeps running daily and keeps creating snapshots
  5. After a few days (probably 10-40 days) old snapshots get cleaned up
  6. The first reboot into an updated system works fine, because it's still using the /etc overlay
  7. On the second reboot into an updated system /etc doesn't get merged and all customizations are lost (/etc is reverted back to the "stock" /etc from the Micro OS installation)

This working theory is supported by finding messages like Parent snapshot 3 does not exist any more - skipping rsync in the logs.

We know that it is recommended to reboot as soon as possible after running transactional-update, but the reality is that the reboot does not always happen in a timely manner.

There is a longer discussion here: kube-hetzner/terraform-hcloud-kube-hetzner#1287

There are a few more issues reported by users, but mostly they are unsolved, because people might just recreate the nodes, switch to another project, or give up, instead of investigating it thoroughly.

@mysticaltech
Copy link

@Vogtinator @sysrich That will be a life saver for our project https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner, we are loosing nodes because of that issue.

@laenion
Copy link
Collaborator

laenion commented May 6, 2024

Thanks a lot to all involved (and especially @Vogtinator for the patch and @andi0b for detailed breakdown here and in kube-hetzner/terraform-hcloud-kube-hetzner#1287 (reply in thread)). It seems this problem affects several people, so I won't wait for the reworked overlay handling, but apply the pull request immediately.

@laenion laenion merged commit daf0098 into openSUSE:master May 6, 2024
@Vogtinator Vogtinator deleted the nosyncfailmaster branch May 6, 2024 10:06
@andi0b
Copy link

andi0b commented May 6, 2024

@laenion Thanks! I just want to add that I didn't test this PR at all, or test if this fixes our issue. I stumbled upon it and wanted to highlight the severity of the issue.

@laenion
Copy link
Collaborator

laenion commented May 6, 2024

No worries: I tested it and also verified that it actually solves the problem ;-)

bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request May 9, 2024
https://build.opensuse.org/request/show/1172470
by user fos + dimstar_suse
- Version 4.6.8
  - tukit: Properly handle overlay syncing failures: If the system would not be rebooted and several snapshots accumulated in the meantime, it was possible that the previous base snapshot - required for /etc syncing - was deleted already. In that case changes in /etc might have been reset. [gh#openSUSE/transactional-update#116] [gh#kube-hetzner/terraform-hcloud-kube-hetzner#1287]
  - soft-reboot: Log requested reboot type
  - soft-reboot: Don't force hard reboot on version change only
- Version 4.6.7
  - Add support for snapper 0.11.0; also significantly decreases cleanup time [boo#1223504]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants