Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/sysroot-deploy: Add experimental support for automatic early prune #2847

Merged
merged 4 commits into from
May 1, 2023

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Apr 13, 2023

During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[1]

Because stale entries in /boot are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the (kernel, initrd) pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
before new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(OSTREE_SYSROOT_OPTS=early-prune). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned only if it's the only way for the system to update.

Closes: #2670

@openshift-ci
Copy link

openshift-ci bot commented Apr 13, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@jlebon
Copy link
Member Author

jlebon commented Apr 13, 2023

I still need to add tests for this. Prep patches split out in #2848.

@jlebon jlebon force-pushed the pr/calculate-and-cleanup branch 2 times, most recently from 5f71d07 to 23bedeb Compare April 14, 2023 16:24
@jlebon jlebon marked this pull request as ready for review April 14, 2023 16:24
@jlebon
Copy link
Member Author

jlebon commented Apr 14, 2023

Now with a test!

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a skim, seems sane.

src/libostree/ostree-sysroot-deploy.c Show resolved Hide resolved
src/libostree/ostree-sysroot-deploy.c Show resolved Hide resolved
…ization

In the unusual case where one is manually finalizing staged deployments,
as can happen in testing, we expect a successful finalization to remove
the failure stamp file.
AFAICT, I don't see how `runkola.sh` or the Makefile in `tests/kolainst`
can create files in `tests/kola` since it's geared towards installing
under `/usr`.
When hacking and testing locally with `cosa build-fast` and `kola run`,
I prefer to leave testing framework stuff within the work directory
rather than installed in my pet container. Add a `localinstall` target
for this which puts the tests in `tests/kola`. Then a simple `kola run`
will pick it up.
@dustymabe
Copy link
Contributor

During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[1]

Because stale entries in /boot are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the (kernel, initrd) pairs for the union of current and new deployments.

This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted before new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (OSTREE_EXP_AUTO_EARLY_PRUNE). Once we gain more experience with it, we can consider turning it on by default.

If we lift it out of experimental in the future, but leave it off by
default should we consider removing EXP from the name of the var
here?

Also, should we consider allowing the var to be used to turn off the
behavior (i.e. if we flip it to the the default in the future and
someone finds they don't want the behavior).

This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning. This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned only if it's the only way for the system to update.

So basically if you have X,Y -> Z and there is a logic problem
in Y you can't get back to X in order to update to Z? Your words
are a bit hard to decipher but this simple example makes it clear to me
(assuming the simple example is correct).

@jlebon
Copy link
Member Author

jlebon commented Apr 14, 2023

If we lift it out of experimental in the future, but leave it off by default should we consider removing EXP from the name of the var here?

The idea with the EXP variable is that it's purposely temporary. If it's stabilized but left off by default, I think we'd want this controlled by e.g. a repo config knob instead. That said, I've removed the EXP wording since it could make sense even long-term to have a variable override for it to get out of a pickle (given that e.g. repo configs are unmanaged, but systemd dropins aren't).

Also, should we consider allowing the var to be used to turn off the behavior (i.e. if we flip it to the the default in the future and someone finds they don't want the behavior).

Since it's currently not the default, this would be decided at the time we do decide to flip it to be the default. We could e.g. keep the variable around and slightly tweak the existing checks. Related to this, I've updated the check so that we not only skip it if the variable is defined but empty, but also if it's set to 0.

So basically if you have X,Y -> Z and there is a logic problem in Y you can't get back to X in order to update to Z? Your words are a bit hard to decipher but this simple example makes it clear to me (assuming the simple example is correct).

I didn't want to dig too much into this in the commit message because indeed it's a bit subtle and would make it even longer. I linked to #2670 (comment) instead which with the following comments should help I think.

src/libostree/ostree-sysroot-deploy.c Show resolved Hide resolved
src/libostree/ostree-sysroot-deploy.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot-deploy.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot-deploy.c Outdated Show resolved Hide resolved
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor changes

During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with
it, we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning (see [[2]]
and following). This is however mitigated by the fact that the heuristic
is opportunistic: the rollback is pruned *only if* it's the only way for
the system to update.

[1]: coreos/fedora-coreos-tracker#1247
[2]: ostreedev#2670 (comment)

Closes: ostreedev#2670
@jlebon jlebon force-pushed the pr/calculate-and-cleanup branch from b159791 to c561e61 Compare May 1, 2023 16:12
@jlebon
Copy link
Member Author

jlebon commented May 1, 2023

Updated for comments!

@jlebon
Copy link
Member Author

jlebon commented May 26, 2023

I forgot to update the commit message on this to reference the new environment variable knob (OSTREE_SYSROOT_OPTS=early-prune) rather than the initial OSTREE_ENABLE_AUTO_EARLY_PRUNE one. I've updated the description, but it'll forever be in git history, lying to readers... 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a repo option to auto-prune other deployments (e.g. rollback) when starting upgrade
3 participants