Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysroot: Rework /var handling to act like Docker VOLUME /var #3166

Merged
merged 1 commit into from
Feb 10, 2024

Conversation

cgwalters
Copy link
Member

We've long struggled with semantics for /var. Our stance of "/var should start out empty and be managed by the OS" is a strict one, that pushes things closer to the original systemd upstream ideal of the "OS state is in /usr".

However...well, a few things. First, we had some legacy bits here which were always populating the deployment /var. I don't think we need that if systemd is in use, so detect if the tree has usr/lib/tmpfiles.d, and don't create that stuff at ostree admin stateroot-init time if so.

Building on that then, we have the stateroot var starting out actually empty.

When we do a deployment, if the stateroot var is empty, make a copy (reflink if possible of course) of the commit's /var into it.

This matches the semantics that Docker created with volumes, and this is sufficiently simple and easy to explain that I think it's closer to the right thing to do.

Crucially...it's just really handy to have some pre-existing directories in /var in container images, because Docker (and podman/kube/etc) don't run systemd and hence don't run tmpfiles.d on startup.

I really hit on the fact that we need /var/tmp in our container images by default for example.

So there's still some overlap here with e.g. /usr/lib/tmpfiles.d/var.conf as shipped by systemd, but that's fine - they don't actually conflict per se.

Copy link

openshift-ci bot commented Feb 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@cgwalters
Copy link
Member Author

/test all

cgwalters added a commit to cgwalters/ostree-rs-ext that referenced this pull request Feb 9, 2024
This is intended to pair with ostreedev/ostree#3166

If we detect a new enough ostree version, then by default don't remap
content in `/var`, assuming that ostree itself will handle it.

In order to unit test this (without depending on the ostree version
that happens to be on the host) add an API to the importer
which allows overriding the version.
@cgwalters
Copy link
Member Author

xref ostreedev/ostree-rs-ext#602

We've long struggled with semantics for `/var`.  Our stance of
"/var should start out empty and be managed by the OS" is a strict
one, that pushes things closer to the original systemd upstream
ideal of the "OS state is in /usr".

However...well, a few things.  First, we had some legacy bits
here which were always populating the deployment `/var`.  I don't
think we need that if systemd is in use, so detect if the tree
has `usr/lib/tmpfiles.d`, and don't create that stuff at
`ostree admin stateroot-init` time if so.

Building on that then, we have the stateroot `var` starting out
actually empty.

When we do a deployment, if the stateroot `var` is empty,
make a copy (reflink if possible of course) of the commit's `/var`
into it.

This matches the semantics that Docker created with volumes,
and this is sufficiently simple and easy to explain that I think
it's closer to the right thing to do.

Crucially...it's just really handy to have some pre-existing
directories in `/var` in container images, because Docker (and podman/kube/etc)
don't run systemd and hence don't run `tmpfiles.d` on startup.

I really hit on the fact that we need `/var/tmp` in our container
images by default for example.

So there's still some overlap here with e.g. `/usr/lib/tmpfiles.d/var.conf`
as shipped by systemd, but that's fine - they don't actually conflict
per se.
@cgwalters
Copy link
Member Author

OK yep, I've tested this more "end-to-end" in combination with ostreedev/ostree-rs-ext#602 and things work well.

Copy link
Member

@jmarrero jmarrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cgwalters cgwalters merged commit cb3c42e into ostreedev:main Feb 10, 2024
25 checks passed
cgwalters added a commit to cgwalters/ostree-rs-ext that referenced this pull request Feb 11, 2024
This is intended to pair with ostreedev/ostree#3166

If we detect a new enough ostree version, then by default don't remap
content in `/var`, assuming that ostree itself will handle it.

In order to unit test this (without depending on the ostree version
that happens to be on the host) add an API to the importer
which allows overriding the version.
@cgwalters
Copy link
Member Author

cgwalters commented Feb 11, 2024

One thing I was thinking about here is how this interacts with people who want to do a separate /var mount. It still makes sense to support combining them.

Looking at the current anaconda logic (assuming tasks aren't being run in parallel) then it looks like what happens is that the deploy task combines the stateroot-init with a deployment - and that's when we copy the container /var to the stateroot var.

And at this point, there's no mount point there. It's only after that that the mount point task runs - this is basically reimplementing the logic from ostree-prepare-root.

So in order to properly get the commit/container-image /var content on a separate mountpoint, it looks like we need to do one of two things:

Move data out of stateroot var

Possibly easiest: When we init the mountpoint in the second step, move data that ostree admin deploy wrote into the stateroot /var into the final mount

Split up phases more

Probably better: Split up the stateroot-init phase from the deploy, so the flow goes like this:

  • stateroot-init
  • mount /var from mount point if provided
  • deploy (which will write to the mount point)
  • bind mount stateroot var to target root

@travier
Copy link
Member

travier commented Feb 12, 2024

So if I understand this correctly, this behaves like the factory option but with the content kept in /var?

At installation time, it will copy /var from the container to the stateroot.

What happens on updates?

If we're moving away from the factory option, shouldn't we completely remove all traces of it to remove confusion?

@cgwalters
Copy link
Member Author

What happens on updates?

The default is that ostree will see that /var is non-empty, and not do anything with any content from the new commit/container /var content.

However, note this aligns with "factory reset" semantics; one can do a flow of rm -rf /var/* and reboot (in practice this would likely need to be done from the initramfs or at least a special systemd target that e.g. kills journald and other processes that might be writing to /var) and get the updated content.

@cgwalters
Copy link
Member Author

If we're moving away from the factory option, shouldn't we completely remove all traces of it to remove confusion?

This is tricky. There's two problems being fixed here. The first is that I discovered belatedly that C+ isn't shipped in the systemd in c9s and so the /var handling didn't work at all there today.

The second is that C+ has more complex semantics, and I think it's ultimately much simpler to just be able to talk about "this works like VOLUME /var".

In theory something else could have started to rely on ostree including the tmpfiles.d rule for /usr/share/factory/var here too.

But yes...perhaps we should just remove it. Yeah, I will do a PR.

@jlebon
Copy link
Member

jlebon commented Mar 25, 2024

BTW, one thing I realized this doesn't handle is that anyone doing /var on a separate filesystem will not get the seed content behaviour. ISTM like instead of deployment time, this should be done at first boot time After=var.mount, Before=local-fs.target or so?

@jlebon
Copy link
Member

jlebon commented Mar 25, 2024

BTW, one thing I realized this doesn't handle is that anyone doing /var on a separate filesystem will not get the seed content behaviour. ISTM like instead of deployment time, this should be done at first boot time After=var.mount, Before=local-fs.target or so?

OK, split that into #3222.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants