Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concept of state overlays #3120

Merged
merged 2 commits into from
Jan 11, 2024
Merged

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Dec 14, 2023

In the OSTree model, executables go in /usr, state in /var and configuration in /etc. Software that lives in /opt however messes this up because it often mixes code and state, making it harder to manage.

More generally, it's sometimes useful to have the OSTree commit contain code under a certain path, but still allow that path to be writable by software and the sysadmin at runtime (/usr/local is another instance).

Add the concept of state overlays. A state overlay is an overlayfs mount whose upper directory, which contains unmanaged state, is carried forward on top of a lower directory, containing OSTree-managed files.

In the example of /usr/local, OSTree commits can ship content there, all while allowing users to e.g. add scripts in /usr/local/bin when booted into that commit.

Some reconciliation logic is executed whenever the base is updated so that newer files in the base are never shadowed by a copied up version in the upper directory. This matches RPM semantics when upgrading packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which any downstream distro/user can enable. The instance name is the mountpath in escaped systemd path notation (e.g. ostree-state-overlay@usr-local.service).

See discussions in #3113 for more details.

@cgwalters
Copy link
Member

There's a lot of appeal to this for sure. I mean, it opens up the question a bit if we were to do this, what would it look like to do it for everything, having it also take over handling of /etc and /var for example too.

We had a realtime chat on this, but for the record: The thing that I fear a bit is basically that while I agree it will largely Just Work for things like RPMs/debs that install in /opt and want to write extra data there...the flip side is anywhere in these directories you can "leak state". Of course, state can be leaked in /etc and /var too today! But a big part of the idea is that admins only have those two places to save/restore. With this, there's potentially new important persistent state in these mount points.

However there are some other things:

Some reconciliation logic is executed whenever the base is updated so that newer files in the base are never shadowed by a copied up version in the upper directory. This matches RPM semantics when upgrading packages whose files may have been modified.

There's a lot of potentially subtle corner cases in this. For example, what happens when the lower replaces a directory with a symlink? (A classic RPM problem). It looks like your logic unconditionally deletes away all content in the upper, which could include extra "local" state files too. That isn't an incorrect behavior, but it's one case which could result in some data loss, depending on the scenario. To be clear there isn't a magic solution here; the crude hammer of root.transient just forces the "data loss" to always happen on every OS update, instead of just sometimes. (Really it forces admins to manually choose what persists by symlinking/mounting those files to /var).

Note that the ostree /etc semantic handles it differently; the "upper" always wins, meaning you can instead silently not get updates in this scenario. Now, changing the "type" of an entry in /etc between dir and !dir is IME exceedingly rare, i.e. I don't know of a time it has happened. But while it is rare, it definitely happens to change between dir and !dir in /usr.


Now, if I'm understanding things right, state overlays are orthogonal to root.transient, which would be great. I could imagine for example wanting to enable root.transient in an OS build, but then also add a state overlay for some specific /opt/someapp.


Visualization/debuggability: overlayfs is a pretty raw kernel feature; but with e.g. container tooling there's tools to introspect the "layers" (that usually map to overlayfs dirs). I am sure we'd need to have something like ostree state-overlay diff like we have ostree admin config-diff too in the future.

@cgwalters
Copy link
Member

I think my biggest concern in a nutshell is the "unknown unknowns" of this at least to me. The implementation of root.transient is extremely simple, and its semantics very well known and easy to understand and explain due to its prevalence since the creation of docker.

This is a more complex approach, and its semantics required me to think carefully and read the code. I wonder if anyone else is using this approach for anything? I can't imagine it's actually novel, but it'd really help me to see something else doing this for some part of a system and how they document it.

I think https://kubic.opensuse.org/documentation/man-pages/transactional-update.8.html might use something like this for /etc?

May see if I can reach out on some social media.

@cgwalters

This comment was marked as outdated.

@ericcurtin

This comment was marked as off-topic.

@ericcurtin

This comment was marked as off-topic.

@jlebon
Copy link
Member Author

jlebon commented Dec 15, 2023

There's a lot of appeal to this for sure. I mean, it opens up the question a bit if we were to do this, what would it look like to do it for everything, having it also take over handling of /etc and /var for example too.

Yeah, definitely could be interesting. Obviously, it's got a much larger impact (and e.g. the fact that overlayfs isn't fully POSIX compliant has a higher chance to rear its head if all OS state is on it).

We had a realtime chat on this, but for the record: The thing that I fear a bit is basically that while I agree it will largely Just Work for things like RPMs/debs that install in /opt and want to write extra data there...the flip side is anywhere in these directories you can "leak state". Of course, state can be leaked in /etc and /var too today! But a big part of the idea is that admins only have those two places to save/restore. With this, there's potentially new important persistent state in these mount points.

The nice thing though is that the upperdir is on /var and because of the beauty of overlayfs, the dir is quite usable from a backup perspective. There are the whiteout character devices which are weird, but a backup service could pretty much ignore those. Backing up the mountpoint is OK too, but would pick up base content too.

There's a lot of potentially subtle corner cases in this. For example, what happens when the lower replaces a directory with a symlink? (A classic RPM problem). It looks like your logic unconditionally deletes away all content in the upper, which could include extra "local" state files too. That isn't an incorrect behavior, but it's one case which could result in some data loss, depending on the scenario. To be clear there isn't a magic solution here; the crude hammer of root.transient just forces the "data loss" to always happen on every OS update, instead of just sometimes. (Really it forces admins to manually choose what persists by symlinking/mounting those files to /var).

Note that the ostree /etc semantic handles it differently; the "upper" always wins, meaning you can instead silently not get updates in this scenario. Now, changing the "type" of an entry in /etc between dir and !dir is IME exceedingly rare, i.e. I don't know of a time it has happened. But while it is rare, it definitely happens to change between dir and !dir in /usr.

Indeed. I'll note though that the condition here would have to be that it's a file in e.g. /opt or /usr/local which changes from dir to !dir and that it's in the upper dir (i.e. users/apps are expected to write to that dir). Given that RPM itself errors out, I wonder how likely this is. Of course, users could do whatever they want and write to places they shouldn't, so definitely something to be aware of. We could also instead of deleting, rename the directory out of the way and warn.

Now, if I'm understanding things right, state overlays are orthogonal to root.transient, which would be great. I could imagine for example wanting to enable root.transient in an OS build, but then also add a state overlay for some specific /opt/someapp.

Correct!

Visualization/debuggability: overlayfs is a pretty raw kernel feature; but with e.g. container tooling there's tools to introspect the "layers" (that usually map to overlayfs dirs). I am sure we'd need to have something like ostree state-overlay diff like we have ostree admin config-diff too in the future.

Could be interesting down the line. In the vanilla additive case, tree /usr/ostree/state-overlays/$overlay/upper gives you that, but definitely whiteouts and opaque dirs could benefit from better visualization.

I think my biggest concern in a nutshell is the "unknown unknowns" of this at least to me. The implementation of root.transient is extremely simple, and its semantics very well known and easy to understand and explain due to its prevalence since the creation of docker.

This is a more complex approach, and its semantics required me to think carefully and read the code.

I think this is worth teasing apart a bit more. Semantics-wise, I would describe it as "package manager-like", which is of course familiar to a lot of people in this space. If you have to describe it to an end-user, that would suffice. Where it's more complex is in its implementation, which yes, if you hit some corner case you might have to peak under the covers and understand it. We certainly should have docs for that, but the goal is certainly that most people will not need to care.

Agreed re. "unknown unknowns". I'm optimistic about this, but we really need people testing it in realistic scenarios (this is the motivation behind the last commit in coreos/rpm-ostree#4728). Then we can see how well this works and whether we should move forward trying to polish and stabilize it.

I wonder if anyone else is using this approach for anything? I can't imagine it's actually novel, but it'd really help me to see something else doing this for some part of a system and how they document it.

Yeah, hard to find prior art on this specifically. Definitely also interested to hear from overlayfs SMEs. Maybe @rhvgoyal?

@cgwalters
Copy link
Member

@ericcurtin I was just referencing that project as potential prior art to look at for an small implementation detail, not proposing depending on it.

src/ostree/ot-admin-builtin-state-overlay.c Outdated Show resolved Hide resolved
src/ostree/ot-admin-builtin-state-overlay.c Outdated Show resolved Hide resolved
src/ostree/ot-admin-builtin-state-overlay.c Outdated Show resolved Hide resolved
@@ -42,6 +42,8 @@ static OstreeCommand admin_subcommands[] = {
"Change the finalization locking state of the staged deployment" },
{ "boot-complete", OSTREE_BUILTIN_FLAG_NO_REPO | OSTREE_BUILTIN_FLAG_HIDDEN,
ot_admin_builtin_boot_complete, "Internal command to run at boot after an update was applied" },
{ "state-overlay", OSTREE_BUILTIN_FLAG_NO_REPO | OSTREE_BUILTIN_FLAG_HIDDEN,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI is hidden, but the systemd unit is not; so this would become insta-stable; is that the intention? I'm OK with that, just checking.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to merge this to just make it easier to experiment with. We'll definitely want a man page for this though.

I'm not totally sure if this should be marked as experimental or not.

@jlebon
Copy link
Member Author

jlebon commented Jan 9, 2024

I'm OK to merge this to just make it easier to experiment with. We'll definitely want a man page for this though.

Sure, can add one.

The CLI is hidden, but the systemd unit is not; so this would become insta-stable; is that the intention? I'm OK with that, just checking.
...
I'm not totally sure if this should be marked as experimental or not.

My intent was to not declare it as stabilized yet, because I really feel it needs validation in the real world first. I didn't think about how to gate the systemd unit itself. I guess we could name it e.g. ostree-experimental-state-overlay@.service and then rename it once it's stabilized? A bit awkward. We'd also have to carry the old unit name for a while.

In practice, I think people aren't going to know to turn this unit on without looking at the docs, where we can explicitly say this is still experimental. (And even more realistically, I think the primary user of this before stabilizing will be the environment client-side knob demo'ed in coreos/rpm-ostree#233 (comment), where it's more clear that it's experimental.)

@jlebon
Copy link
Member Author

jlebon commented Jan 9, 2024

Updated!

Though this now requires https://gitlab.gnome.org/GNOME/libglnx/-/merge_requests/52.

Edit: Oh right, this still also needs some docs. Done!

In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
Bumps libglnx from `aff1eea` to `b415d046`.

For https://gitlab.gnome.org/GNOME/libglnx/-/merge_requests/52.

Update submodule: libglnx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants