Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix EBUSY errors under overlayfs and v4.13+ kernels #34948

Merged
merged 2 commits into from
Nov 29, 2017

Conversation

euank
Copy link
Contributor

@euank euank commented Sep 22, 2017

This removes and recreates the merged dir with each umount/mount
respectively.
This is done to make the impact of leaking mountpoints have less
user-visible impact.

It's fairly easy to accidentally leak mountpoints (even if moby doesn't,
other tools on linux like 'unshare' are quite able to incidentally do
so).

As of recently, overlayfs reacts to these mounts being leaked (see

One trick to force an unmount is to remove the mounted directory and
recreate it. Devicemapper now does this, overlay can follow suit.

- Description for the changelog

Fix upperdir in use warnings under overlayfs and v4.13+ kernels

- A picture of a cute animal (not mandatory but encouraged)

red panda

@euank
Copy link
Contributor Author

euank commented Sep 22, 2017

cc @rhvgoyal as the author of #22069, which this reverts, and @tonistiigi as the author of #27609, which I think makes the change in #22069 unnecessary.

@euank
Copy link
Contributor Author

euank commented Sep 25, 2017

The powerpc failure is a flake. I don't know how to kick the jenkins to re-run

@GordonTheTurtle GordonTheTurtle added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Sep 25, 2017
@thaJeztah
Copy link
Member

@euank Looks like you forgot to sign-off the second commit 😅

also ping @dmcgowan @rhvgoyal @tonistiigi PTAL

@GordonTheTurtle GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Sep 25, 2017
@rhvgoyal
Copy link
Contributor

cc @amir73il

Ok, so kernel commit 2cac0c00a6cdcc9 started giving errors in case of mount point leaks. Trying to avoid mount point leaks completely will make sense only if mount propagation was synchronous. That is by the time unmount finished, all the propagations and unmounting must have finished before it. If not, then we still have races where a leaked mount has not been unmounted yet (because propagation has not happened yet) and we are trying to create a mount point using same upper directory. I don't know if unmount is synchronous or not.

Secondly, even if you make slave, we will still have the race because runc makes all mount points rprivate by default and leaves a small race window where your unmount will not propagate.

Making it slave/shared in a way makes it more likely that we will leak mount points. Think of an application just doing unshare -m and now all future mounts under /var/lib/docker/overlay2/ will propagate to this application.

I feel we can try to minimize mount point leaks but being able to take care of all corner cases will be very hard. So I am wondering if a solution should come from kernel side instead.

I am wondering will it make sense to convert kernel error message to a warning message? Or is there a way where kernel can figure out that we are essentially using same set of lower/upper/work directories and instead of instantiating a new super block, it re-uses existing super block (which is around due to busy mount). Amir will know much more, I am ccing him.

@amir73il
Copy link
Contributor

amir73il commented Sep 26, 2017

@rhvgoyal overlayfa is right to complain, so kernel warning is not good enough.
re-using the same super block sounds interesting, but I have no idea how hard it would be
or if that idea will be acceptable.

@tonistiigi The root cause of the problem IMO is the chroot() does not Unmount
all the tree under pivotDir.
You see making all mounts private in the new ns is just the first step.
The cloned overlay mounts, even though private are still in use by the new ns,
so they must be unmounted after pivot_root.
The Mount syscall below is not needed because all mounts in new ns are already private.
The Umount syscall below is wrong because it uses MNT_DETACH instead of
unmounting all tree below pivotDir synchronically.

// Make the pivotDir (where the old root lives) private so it can be unmounted without propagating to the host
if err := syscall.Mount("", pivotDir, "", syscall.MS_PRIVATE|syscall.MS_REC, ""); err != nil {
	return fmt.Errorf("Error making old root private after pivot: %v", err)
}

// Now unmount the old root so it's no longer visible from the new root
if err := syscall.Unmount(pivotDir, syscall.MNT_DETACH); err != nil {
	return fmt.Errorf("Error while unmounting old root after pivot: %v", err)
}

@moby moby deleted a comment from GordonTheTurtle Sep 26, 2017
@euank
Copy link
Contributor Author

euank commented Sep 26, 2017

@rhvgoyal

Trying to avoid mount point leaks completely will make sense only if mount propagation was synchronous.

That's not true. I can reliably reproduce a mount point leak to runc (leaked for 100ms to 1s) without this patch. With this patch I can't. It therefore still makes sense to try and avoid it.
Just because we can't be perfect isn't a reason to do nothing.

I don't know if unmount is synchronous or not.

Because the code here uses MNT_DETACH as of #33638 we're opting in to async behavior either way. I think that PR is technically wrong, but that's something to address separate from this.
I do agree this code is still wrong in theory. I'm simply fixing the instance that is hit reliably in practice.

Secondly, even if you make slave, we will still have the race because runc makes all mount points rprivate by default.

False. runc makes the mountpoints rslave. See either my original issue I say this fixes and my discussion there or my pr against runc and the referenced defaulting code.

Making it slave/shared in a way makes it more likely that we will leak mount points. Think of an application just doing unshare -m and now all future mounts under /var/lib/docker/overlay2/ will propagate to this application.

I disagree. unshare -m by default also sets private on /, so what you say will happen won't. An unshare -m someone runs won't get future mounts either before or after this change. It will break things either way, but I'm trying to fix runc stomping on docker, not arbitrary external users stomping on docker. Before or after this change, external users that don't use rslave will break docker roughly equally badly I think.

If they do unshare -m --propagation slave then they will get new mounts, but the umounts from the host will also propagate, so it's not a big deal. That mountpoint isn't leaked unless the program inside of unshare -m then remounts it private.

I feel we can try to minimize mount point leaks but being able to take care of all corner cases will be very hard. So I am wondering if a solution should come from kernel side instead.

A better solution would be something like @cyphar suggests imo.
I think this current change is an incremental improvement that is necessary to fix real issues though.

Or is there a way where kernel can figure out that we are essentially using same set of lower/upper/work directories and instead of instantiating a new super block, it re-uses existing super block (which is around due to busy mount).

A good first step there would be to not do "mount umount mount umount mount" to start a container, but instead just mount it once and leave it mounted.
That's also a significantly more painful refactor to do when there's already bleeding.


@amir73il

The root cause of the problem IMO is the chroot() does not Unmount
all the tree under pivotDir.

Read my original issue (#34672) where I talk about the actual issue I'm trying to solve. There may be more issues, but the issue I'm trying to fix does not actually use any of that code so it's not closely related to what I see as the root problem.

I agree the things you pointed out are problems (though I think the solution is making it rslave not rprivate), but I don't think that's related to getting this merged or the specific issue I'm trying to fix.

@amir73il
Copy link
Contributor

Yeh I am confused of all the different issues and fixes. Anyway, the solution of rslave and lazy umount in cloned ns looks good.

@euank
Copy link
Contributor Author

euank commented Sep 26, 2017

@amir73il

Yeh I am confused of all the different issues and fixes. Anyway, the solution of rslave and lazy umount in cloned ns looks good.

This PR is for fixing something else though. This is to fix the case where mounting /var/lib/overlay2 as private explicitly means that an unshare -m --propagation rslave, such as the one done by runc, will have a private reference to /var/lib/overlay2/$id/merged, meaning that the overlay2 driver leaks the mount until the program with a cloned mountns exits.

Are you willing to review this change, or do you have specific questions about this issue that isn't answered already by my comments in #34672?

@amir73il
Copy link
Contributor

I can only ACK the change that doesn't make home dir private. I don't know enough on the bug picture to say if var/lib/docker/overlay is always shared to begin with.

I suppose if makes little sense to make home dir private for other graph drivers? And the rslave change should be made to chroot() as well. Otherwise this fix is correct but partial. You shouldn't fix just the problem you see in front of you when you know there is a bigger issue to solve. But fine with me of this is the first step.

@euank
Copy link
Contributor Author

euank commented Sep 27, 2017

@amir73il I apologize if I came off a little roughly. The internet does make it hard to convey tone properly.

I can only ACK the change that doesn't make home dir private

Thanks! What I'm looking for here is making sure this change is in the right direction

... var/lib/docker/overlay is always shared to begin with.

I think it generally will be, especially in a systemd world. It's possible this should be swapping MakePrivate over to MakeShared, though I'm a little more wary about explicit shared than defaulting shared

I suppose if makes little sense to make home dir private for other graph drivers?

It might, but I'm not an expert. I do know there are some related issues floating around for devicemapper at the least.

And the rslave change should be made to chroot() as well

Yup, happy to look at that and changing the MNT_DETACH bit later. One step at a time.

@euank
Copy link
Contributor Author

euank commented Sep 28, 2017

Review bump for this change

@rhvgoyal
Copy link
Contributor

@amir73il Thinking more about it, it feels like regression to me (from userspace point of view). Sure, we can try modifying user space to not leak mount points, but kernel upgrade will still break existing container runtime.

And making sure setup is completely right and none of the mounts point are leaking is hard.

So if I take a device /dev/foo and mount it on /mnt/test, leak the mount point, unmount it and mount again, it still works. It most likely gets same superblock.

IMHO, either we need to implement same behavior for overlay or reduce the level from error to warning.

And this is irrespective of docker changes. Sure try to reduce the amount of leaked mount points, that can only help. But it still feels like a regression.

@rhvgoyal
Copy link
Contributor

rhvgoyal commented Sep 28, 2017

@amir73il In some ways it feels like DOS to me. If an overlay mount point has leaked somewhere, then root can not create a new overlay mount point and get to its data.

Terminal 1

mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/

Terminal 2

unshare -m

Terminal 1

umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/
mount: /root/overlay-testing/merged: none already mounted or mount point busy

Now root can't access its own data due to leaked mount point in some process mount namespace. And that sounds not desirable to me (given how easy it is to leak mount points).

@amir73il
Copy link
Contributor

@rhvgoyal wrt regression you are definitely right. This should be fixed in stable kernels.
I will post a patch to make this new enforcement depend on new config option INDEX=y.

@tonistiigi
Copy link
Member

cc @kolyshkin

@kolyshkin
Copy link
Contributor

In my limited testing, this patch indeed helps with EBUSY on /merged on removal, which I can reproduce easily on a RHEL 7.4 system with sysfs fs.may_detach_mounts = 0 and 100-200 short-lived containers starting/ending in parallel (docker run --rm busybox true).

This is not ultimate fix though as I still see occasional EBUSY on /shm removal (which is somewhat
harder to reproduce).

@euank
Copy link
Contributor Author

euank commented Oct 12, 2017

@kolyshkin For the reasons described in #34672, I also had to apply this runc patch to fix my issue: opencontainers/runc#1598

Is there a separate issue or bugzilla filed for the /shm ebusy you get with further details? It's possible that runc patch could help further, or that other people could help investigate.

Also, review ping! It seems like this PR has stalled out, but I don't think I've gotten any actionable requests / reasons why it's stalled.

@dmcgowan and @rhvgoyal, want to review this change?

Even if overlayfs lets us get away with leaking mounts, I don't think we should be and so IMO this should still be merged.

@euank
Copy link
Contributor Author

euank commented Nov 21, 2017

Intentionally leaving the fixup! unsigned-off to be sure I go through and autosquash it before this merges.

I think I've addressed or replied to the review comments.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for a tiny nitpick

}
if rmErr := os.RemoveAll(mergedDir); rmErr != nil {
logrus.Warnf("Failed to remove %s: %v: %v", id, rmErr, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/craeted/created/

@@ -546,8 +543,9 @@ func (d *Driver) Get(id, mountLabel string) (_ containerfs.ContainerFS, retErr e
if mntErr := unix.Unmount(mergedDir, 0); mntErr != nil {
logrus.Errorf("error unmounting %v: %v", mergedDir, mntErr)
}
if rmErr := os.RemoveAll(mergedDir); rmErr != nil {
logrus.Warnf("Failed to remove %s: %v: %v", id, rmErr, err)
// Cleanup the craeted merged directory; see the comment in Put's rmdir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@kolyshkin
Copy link
Contributor

Can you please rebase it?

Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
This removes and recreates the merged dir with each umount/mount
respectively.
This is done to make the impact of leaking mountpoints have less
user-visible impact.

It's fairly easy to accidentally leak mountpoints (even if moby doesn't,
other tools on linux like 'unshare' are quite able to incidentally do
so).

As of recently, overlayfs reacts to these mounts being leaked (see

One trick to force an unmount is to remove the mounted directory and
recreate it. Devicemapper now does this, overlay can follow suit.

Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
@GordonTheTurtle GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Nov 22, 2017
@euank
Copy link
Contributor Author

euank commented Nov 22, 2017

rebased, addressed nits, autosquashed

@cpuguy83
Copy link
Member

From the Z run:

22:52:17 error loading frozen images: ApplyLayer exit status 1 stdout: stderr: unexpected EOF

Rebuilding.

@cyphar
Copy link
Contributor

cyphar commented Nov 24, 2017

LGTM.

@thaJeztah
Copy link
Member

ping @kolyshkin PTAL - nits were addressed

Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor

LGTM (disclaimer: I'm not a maintainer)

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks all for reviewing!

@doits
Copy link

doits commented Dec 10, 2017

Looks like this did not land in 17.11? I had to apply the patch manually, here's a slightly modified version which applies correctly on 17.11 in case anyone else needs this.

doits added a commit to doits/docker-ce that referenced this pull request Dec 10, 2017
@thaJeztah
Copy link
Member

No it's not part of 17.11, will be in 17.12

@euank euank deleted the public-mounts branch January 25, 2018 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.