Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switchroot: Ensure /sysroot is set to "private" propagation #1438

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

Downstream BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1498281

This came up as a problem with oci-umount which was trying to ensure some host
mounts like /var/lib/containers don't leak into privileged containers. But
since our /sysroot mount wasn't private we also got a copy there.

We should have done this from the very start - it makes findmnt way, way less
ugly and is just the obviously right thing to do, will possibly create world
peace etc.

Downstream BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1498281

This came up as a problem with `oci-umount` which was trying to ensure some host
mounts like `/var/lib/containers` don't leak into privileged containers.  But
since our `/sysroot` mount wasn't private we also got a copy there.

We should have done this from the very start - it makes `findmnt` way, way less
ugly and is just the obviously right thing to do, will possibly create world
peace etc.
@cgwalters
Copy link
Member Author

See also #960

@rhvgoyal
Copy link

Making this private will only prevent future mount propagation. But all the mounts will be visible when container is created?

@cgwalters
Copy link
Member Author

We're making the sysroot private very early in bootup, long before container systems are started. That seemed to solve the problem of having two mounts in my testing; make sense?

@rhvgoyal
Copy link

Still can't understand it. Even if you make it private very early, if a process later does unshare or clone, it will still see this mount point. Just that if it is private, then any further mounts under this mount will not propagate to other mount namespaces. So not sure how making it private made sure that process which are starting later will not see it or will not get copy of this mount upon clone or unshare.

@cgwalters
Copy link
Member Author

We could make it unbindable too - this basically seems to be the use case for MS_UNBINDABLE.

But still, you said in:
https://bugzilla.redhat.com/show_bug.cgi?id=1498281#c12

I think it is a problem where source mount is visible on host in multiple paths

And now the mount isn't visible in multiple paths, right?

@rhvgoyal
Copy link

MS_UNBINDABLE will not work always. Reason being that libcontainer can change that mount property after doing clone(). IIRC, by default it makes the whole tree "MS_PRIVATE | MS_REC". And that means you will lose the MS_UNBINDABLE property on the mount you don't want to be visible.

@rhvgoyal
Copy link

I guess we probably need something MS_UNCLONABLE, so that upon clone()/unshare(), certain mount and its children are not cloned in newly created copy of mount namespace. I am not aware of any such thing being available.

@cgwalters
Copy link
Member Author

I think what I'm getting at here is, I tested the reproducer from
https://bugzilla.redhat.com/show_bug.cgi?id=1498281#c0
and this patch fixes it. Is there another concrete case you're talking about, and if so, can you work up a docker/bwrap/etc. cmdline for it? Or is this theoretical?

@rhvgoyal
Copy link

If you just want to go by your test results, feel free to commit it. I am getting more at logical explanation of why you are seeing the results.

We don't understand yet why your changes fixed the issue. And I think it is important to understand that.

Can you please help me understand how making a mount MS_PRIVATE will make sure it does not leak into container upon clone().

@cgwalters
Copy link
Member Author

We don't understand yet why your changes fixed the issue. And I think it is important to understand that. Can you please help me understand how making a mount MS_PRIVATE will make sure it does not leak into container upon clone().

It's not about not leaking on clone() - the fix is to not have duplicate mounts in the first place. Right?

@rhvgoyal
Copy link

Yes, if you don't have duplicate mount on host itself that will solve the issue. Ok, so by making it MS_PRIVATE, it will not propagate to other place later I guess and that's how problem has been solved. If yes, that makes sense.

@rhvgoyal
Copy link

I am assuming that when docker starts, it creates /var/lib/docker/overlay2 mount and it might be propagating under /sysroot/...... and then it leaks into container. So if some mount under /sysroot/.... is
made rprivate, I guess that would mean /var/lib/docker/overlay2 mount will not propagate to /sysroot/...
and that should solve the issue.

@cgwalters
Copy link
Member Author

@rh-atomic-bot delegate=rhvgoyal

@rh-atomic-bot
Copy link

✌️ @rhvgoyal can now approve this pull request

@rhvgoyal
Copy link

LGTM

@rhvgoyal
Copy link

@rh-atomic-bot r+ 4bc681d

@rh-atomic-bot
Copy link

⌛ Testing commit 4bc681d with merge 2b8d586...

@rh-atomic-bot
Copy link

☀️ Test successful - status-atomicjenkins
Approved by: rhvgoyal
Pushing 2b8d586 to master...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants