Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Enable composefs + transient root #27

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

cgwalters
Copy link
Member

Let's enable this in our -dev images right now to try to put together our new recommended default experience.

Let's enable this in our -dev images right now to try
to put together our new recommended default experience.
@cgwalters
Copy link
Member Author

This requires ostreedev/ostree#3170

@cgwalters
Copy link
Member Author

And this is blocked by osbuild/bootc-image-builder#18 because that will (hopefully) fix the SELinux labeling for the underlying backing dirs.

(That said I may just try tomorrow to do a quick tactical patch for bib)

@cgwalters
Copy link
Member Author

Just to emphasize the difference is that after this lands, the deployed host will also run more like podman run does. dnf install on the client (deployment) side will "work" - except changes made will be lost across a subsequent bootc upgrade - in the same way that pulling a new new container image in podman or Kube will also lose written data onto the overlay.

We're flipping around a big tradeoff here in that respect. I think it will make the system "feel" quite different from e.g. CoreOS/AtomicDesktops/etc. And we will also want to document how to turn off the transient overlay for those users who don't want it.

Related to this, really, most container images run in production deployments would be much better done with a readonly root. Some deployments go to some effort to try to use quotas to limit writes to the container overlayfs, when it would again just be better done by making it strictly readonly.

But, I think this alignment with the container defaults will help at a practical level.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Can you add what you wrote to the docs? I find the description very helpful and believe it's worth preserving/sharing.

@cgwalters
Copy link
Member Author

I've pushed the combined result of this PR plus dependencies to quay.io/cgwalters/ostest for ease of testing.

@cgwalters cgwalters marked this pull request as ready for review February 13, 2024 12:40
@cgwalters cgwalters marked this pull request as draft February 13, 2024 12:41
@jlebon
Copy link

jlebon commented Feb 13, 2024

Hmm, why not turn on state overlays instead for /opt and /usr/local (which I assume is the primary driver for this)? Since this is all still in flux and heavily being tested, it'll give us a good opportunity to proof it out. Would require actually implementing ostreedev/ostree-rs-ext#573 though.

@cgwalters
Copy link
Member Author

I like the idea of state overlays but again, it breaks the model "this works how docker works".

@cgwalters
Copy link
Member Author

We should definitely document though how to turn off the transient root in a derived build (requires dropping an override config and regenerating the initramfs), and at the same time doing state overlays.

Would require actually implementing ostreedev/ostree-rs-ext#573 though.

While doing it via RUN ostree container commit would be nicer, one can just RUN mv /opt/something /usr/lib/opt/something no?

@jlebon
Copy link

jlebon commented Feb 13, 2024

IMO, it's quite a safe assumption to make that users will expect packages that install in /opt to work the same as they currently do on traditional CentOS/RHEL.

Would require actually implementing ostreedev/ostree-rs-ext#573 though.

While doing it via RUN ostree container commit would be nicer, one can just RUN mv /opt/something /usr/lib/opt/something no?

Yeah, I think that'd work (but obviously also breaks the "works just like in traditional flows" abstraction).

@cgwalters
Copy link
Member Author

IMO, it's quite a safe assumption to make that users will expect packages that install in /opt to work the same as they currently do on traditional CentOS/RHEL.

Can you elaborate on "work the same"? The main salient difference here is persistence of extra files across updates, is that what you're referring to?

@jlebon
Copy link

jlebon commented Feb 13, 2024

IMO, it's quite a safe assumption to make that users will expect packages that install in /opt to work the same as they currently do on traditional CentOS/RHEL.

Can you elaborate on "work the same"? The main salient difference here is persistence of extra files across updates, is that what you're referring to?

Right, application state across upgrades. The fact it applies only to applications that install to /opt and not globally also makes it harder to map to a model. Unless you're versed in OSTree/packaging internals, it's not obvious why /opt should be different. Do we want to expose users to this difference or try to make things Just Work like they do in any other context (application Dockerfile or traditional systems)?

In a way, state overlays matches better the "implicit VOLUME /var" we've pivoted to recently. E.g. nuking /var and rebooting would correctly nuke all application state. That's not the case for transient root.

@cgwalters
Copy link
Member Author

Right, application state across upgrades. The fact it applies only to applications that install to /opt and not globally also makes it harder to map to a model.

If in this statement "it" = "transient root", that's not true. It applies equally to /usr/local or for that matter arbitrary new toplevels like /some-custom-app.

Or stated simply:

  • /: transient overlay
  • /etc: 3 way merge
  • /var: Like VOLUME /var

The big difference between stateoverlays and transient root is that /opt is not a special case, it's just on /.

(In both of course we still have /home -> /var/home, even though actually many production cases would actually be happier with a fully transient /home)

Hmm the more I think about it the more perhaps having root.transient persist across reboots by default is wrong...it conflicts more with a secure boot scenario.

Do we want to expose users to this difference or try to make things Just Work like they do in any other context (application Dockerfile or traditional systems)?

I am confused because it sounds like you're arguing for transient root when you mention Dockerfile - in both transient root (host) and podman run we get an overlayfs that goes away when the container exits. Now of course a container build process i.e. RUN foo captures state - but there's no merge process with any "hidden" state.

state overlays don't exist in the default container ecosystem.

In a way, state overlays matches better the "implicit VOLUME /var" we've pivoted to recently. E.g. nuking /var and rebooting would correctly nuke all application state.

It's a bit more nuanced that that as docker behavior doesn't quite match the ostree one for /var; the content is only copied the very first time the volume is created. We don't quite have an equivalent of this today.

But yes, I am coming around to the idea that having the overlayfs persist by default across reboots was a bad idea.

We could probably in theory just reuse systemd.volatile=overlay but at the minor cost of a double overlayfs.

@jlebon
Copy link

jlebon commented Feb 13, 2024

Right, application state across upgrades. The fact it applies only to applications that install to /opt and not globally also makes it harder to map to a model.

If in this statement "it" = "transient root", that's not true. It applies equally to /usr/local or for that matter arbitrary new toplevels like /some-custom-app.

Right, but /opt is the big ticket item here. /usr/local doesn't really have the issue with "colocated state". I'm sure there are apps out there that create /some-custom-app against the FHS that we might have to provide guidance for (briefly: if it's pure code, that should work as is, if it's pure state, it'd work to e.g. make it a /var symlink/bind-mount, if it's both code and state, it'd need a state overlay). We could also transparently enable state overlays for any toplevel dir we detect like this if we wanted... Just another systemd unit template instantiation.

But again, everything else on the system doesn't behave this way (given that everything else is in /usr), so I think it's fair to still characterize it as special-casing. :)

Do we want to expose users to this difference or try to make things Just Work like they do in any other context (application Dockerfile or traditional systems)?

I am confused because it sounds like you're arguing for transient root when you mention Dockerfile - in both transient root (host) and podman run we get an overlayfs that goes away when the container exits. Now of course a container build process i.e. RUN foo captures state - but there's no merge process with any "hidden" state.

Let me rephrase this a different way: users will be installing various applications in their Dockerfile as part of creating a derived container image. Some of those applications will be in /usr and some in /opt. To require users to know there is a difference between those two classes of applications invites friction. They don't need to know this today whether using a traditional system or using an application container (in both cases, /usr and /opt behave like each other).

state overlays don't exist in the default container ecosystem.

State overlays isn't trying to model the container ecosystem. It's trying to make what works in traditional systems also work in image-based systems.

cgwalters added a commit to cgwalters/ostree that referenced this pull request Feb 13, 2024
We're debating this over in CentOS/centos-bootc-dev#27
and I have come to the conclusion that having changes to `/`
persist across reboot by default was a bad idea.

- It conflicts with any kind of secure boot scenario
- Having things only go away on upgrades is in some ways even *more* surprising
- The term `transient` implies this

There may be a use case in the future for having something like `root.transient = persistent`,
but this is just a better default.
cgwalters added a commit to cgwalters/ostree that referenced this pull request Feb 13, 2024
We're debating this over in CentOS/centos-bootc-dev#27
and I have come to the conclusion that having changes to `/`
persist across reboot by default was a bad idea.

- It conflicts with any kind of secure boot scenario
- Having things only go away on upgrades is in some ways even *more* surprising
- The term `transient` implies this

There may be a use case in the future for having something like `root.transient = persistent`,
but this is just a better default.

Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link
Member Author

Right, but /opt is the big ticket item here. /usr/local doesn't really have the issue with "colocated state". I'm sure there are apps out there that create /some-custom-app against the FHS that we might have to provide guidance for

Yes, I recently discovered that the beaker project packages tests as RPMs that install to /mnt/test - and also xref internal link.

But also remember...it's not just about RPMs. Because the OCI world created a model where applications own their own filesystem, it's really common to just put binaries in e.g. /app or whatever and not care about namespacing. Some people will likely try to carry that type of thing forward even into bootable host container images.

Should we have a linting tool against this? Probably...

But again, everything else on the system doesn't behave this way (given that everything else is in /usr),

This is definitely a point of disagreement per above. It'd be nice if we lived in that world...

State overlays isn't trying to model the container ecosystem. It's trying to make what works in traditional systems also work in image-based systems.

Right, I agree. The ostree semantic for /etc is very much in that vein too.

We could also transparently enable state overlays for any toplevel dir we detect like this if we wanted... Just another systemd unit template instantiation.

I am just scared by that because it has a much higher level of magic. And by "magic" I mean "what precisely is the state of my files". The core tradeoff that we already covered is automatically having extra state/logs persist (good) versus having unintentional drift persist across changes.

The idea behind image-based updates is having the sha256 digest of the image you get from bootc status be meaningful. And any non-obvious persistent state undermines that. (I don't know what I was thinking with having the root.transient persist across reboots by default)

@cgwalters
Copy link
Member Author

Another angle I'd take this to is that stateoverlays conflicts with a secureboot direction, where we really must strictly separate signed code from local persistent (unverified/unsigned) state like logfiles etc.

Training people to do hacks like RUN ln -s /var/log/someapp.log /opt/someapp/log to redirect from /opt to /var to explicitly persist state (as is needed with either composefs root or transient root) is better aligned with that than stateoverlays.

(And to complete this picture we should be mounting /var with noexec, etc.)

@jlebon
Copy link

jlebon commented Feb 13, 2024

But again, everything else on the system doesn't behave this way (given that everything else is in /usr),

This is definitely a point of disagreement per above. It'd be nice if we lived in that world...

I'm confused. Do you not agree that the great majority of software we ship in our images keep their code in /usr and their state in /var? State overlays is a way to make exceptions like /opt more like the rest of the system whereas transient root is introducing a new conceptual model that applies only to them. I'm arguing that the former is less surprising than the latter.

We could also transparently enable state overlays for any toplevel dir we detect like this if we wanted... Just another systemd unit template instantiation.

I am just scared by that because it has a much higher level of magic.

Right, not necessarily saying we should do that. Mostly food for thought.

The idea behind image-based updates is having the sha256 digest of the image you get from bootc status be meaningful. And any non-obvious persistent state undermines that. (I don't know what I was thinking with having the root.transient persist across reboots by default)

I'll just repeat again that state overlays is about making /opt behave like the rest of system. The digest doesn't cover any content in /var today either. But yeah, we probably should have an easy way to tell apart code from state in /opt (which actually is something that's not straightforward today even on traditional systems).

@cgwalters
Copy link
Member Author

I'm confused. Do you not agree that the great majority of software we ship in our images keep their code in /usr and their state in /var?

For "our images" today - absolutely. But we're blasting open the doors for arbitrary 3rd party code and build processes generating derived images here...

transient root is introducing a new conceptual model that applies only to them.

We may need to do a realtime chat on this. I don't agree with "new conceptual model" because I am arguing it's how docker run worked since it was introduced. I am also unsure what precisely you mean here by "them".

The digest doesn't cover any content in /var today either.

Right. Bu that's how it works with podman/docker run and volumes too (OK I'm going to add VOLUME /var into our default base images now). And more broadly really, the overall rough consensus accelerated by systemd and ostree but definitely predating them back to the FHS is to have /var be this, so that isn't surprising.

@cgwalters
Copy link
Member Author

cgwalters commented Feb 13, 2024

OK I'm going to add VOLUME /var into our default base images now).

➡️ CentOS/centos-bootc#306

cgwalters added a commit to cgwalters/bootc-image-builder that referenced this pull request Feb 14, 2024
See containers/bootc#294
This is particularly motivated by CentOS/centos-bootc-dev#27
because with that suddenly `dnf` will appear to start working
but trying to do anything involving the kernel (i.e. mutating `/boot`)
will end in sadness, and this puts a stop to that.

(This also relates of course to ye olde osbuild#18
 where we want the partitioning setup in the default case
 to come from the container)

Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters added a commit to cgwalters/bootc-image-builder that referenced this pull request Feb 14, 2024
See containers/bootc#294
This is particularly motivated by CentOS/centos-bootc-dev#27
because with that suddenly `dnf` will appear to start working
but trying to do anything involving the kernel (i.e. mutating `/boot`)
will end in sadness, and this puts a stop to that.

(This also relates of course to ye olde osbuild#18
 where we want the partitioning setup in the default case
 to come from the container)

Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters added a commit to cgwalters/bootc-image-builder that referenced this pull request Feb 14, 2024
See containers/bootc#294
This is particularly motivated by CentOS/centos-bootc-dev#27
because with that suddenly `dnf` will appear to start working
but trying to do anything involving the kernel (i.e. mutating `/boot`)
will end in sadness, and this puts a stop to that.

(This also relates of course to ye olde osbuild#18
 where we want the partitioning setup in the default case
 to come from the container)

Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link
Member Author

Jonathan and I had a realtime chat on this. One thing he wasn't aware of is that in the composefs case there is no special read-only bind mount over /usr - it behaves the same as the rest of /. This is one of the big reasons that e.g. dnf install etc. just work when on a transient root.

(And again to restate more generally, I think this aligns with the container model)

We agreed that there is a big user experience difference between transient root and stateoverlays. And the default really matters.

If we go with transient by default, we should definitely create another base image that disables transient root, and enables stateoverlay for /opt for example - just to make it even easier to play with.

Another option I guess is to just do composefs by default, and document (and test) both transient root and stateoverlays. (And hmm, I guess we should actually change bootc usroverlay to bootc rootfs enable-transient or so...)

@jlebon
Copy link

jlebon commented Feb 14, 2024

Jonathan and I had a realtime chat on this. One thing he wasn't aware of is that in the composefs case there is no special read-only bind mount over /usr - it behaves the same as the rest of /. This is one of the big reasons that e.g. dnf install etc. just work when on a transient root.

(And again to restate more generally, I think this aligns with the container model)

We agreed that there is a big user experience difference between transient root and stateoverlays. And the default really matters.

Indeed. Specifically, the trade off is essentially:

If we go with transient by default, we should definitely create another base image that disables transient root, and enables stateoverlay for /opt for example - just to make it even easier to play with.

Another option I guess is to just do composefs by default, and document (and test) both transient root and stateoverlays. (And hmm, I guess we should actually change bootc usroverlay to bootc rootfs enable-transient or so...)

I'm OK with either approach. I'm genuinely interested in what testers/users think of the two UXs. I mostly want the state overlay approach to be given fair consideration and get feedback. I'm less sure about supporting both longer term since it makes the story messier; I'm happy to drop the state overlay bits if it's judged to not add sufficient value.

cgwalters added a commit to cgwalters/centos-bootc that referenced this pull request Feb 14, 2024
This came out of discussion in CentOS/centos-bootc-dev#27

Basically...I think what we should emphasize in the future
is the combination of `bootc` and `dnf`.

There's no really strong reason to use `rpm-ostree` at container
build time versus `dnf`.  Now on the *client* side...well,
here's the interesting thing; with transient root enabled,
`dnf install` etc generally just works.

Of course, *persistent* changes don't.  However, anyone who
wants that can just `dnf install rpm-ostree` in their container
builds.

There is one gap that's somewhat important which is kernel arguments.
Because we haven't taught `grubby` do deal with ostree, and
we don't have containers/bootc#255
to change kargs per machine outside of install time one will
need to just hand-edit the configs in `/boot/loader`.

Another fallout from this is that `ostree container` goes away
inside the booted host...and today actually this totally
breaks bib until osbuild/bootc-image-builder#18
is fixed.

Probably bootc should grow the interception for that too optionally.
github-merge-queue bot pushed a commit to osbuild/bootc-image-builder that referenced this pull request Feb 15, 2024
See containers/bootc#294
This is particularly motivated by CentOS/centos-bootc-dev#27
because with that suddenly `dnf` will appear to start working
but trying to do anything involving the kernel (i.e. mutating `/boot`)
will end in sadness, and this puts a stop to that.

(This also relates of course to ye olde #18
 where we want the partitioning setup in the default case
 to come from the container)

Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters cgwalters marked this pull request as ready for review February 16, 2024 16:34
@cgwalters
Copy link
Member Author

OK actually since we did ostreedev/ostree@0cff65d this means we actually aren't blocked by osbuild/bootc-image-builder#149 anymore!

@cgwalters cgwalters merged commit be1753b into CentOS:main Feb 16, 2024
3 checks passed
@cgwalters
Copy link
Member Author

Now proposing this in the main image in CentOS/centos-bootc#356

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants