Fix EBUSY errors under overlayfs and v4.13+ kernels #34948

euank · 2017-09-22T21:04:43Z

This removes and recreates the merged dir with each umount/mount
respectively.
This is done to make the impact of leaking mountpoints have less
user-visible impact.

It's fairly easy to accidentally leak mountpoints (even if moby doesn't,
other tools on linux like 'unshare' are quite able to incidentally do
so).

As of recently, overlayfs reacts to these mounts being leaked (see

One trick to force an unmount is to remove the mounted directory and
recreate it. Devicemapper now does this, overlay can follow suit.

- Description for the changelog

Fix upperdir in use warnings under overlayfs and v4.13+ kernels

- A picture of a cute animal (not mandatory but encouraged)

euank · 2017-09-22T21:06:33Z

cc @rhvgoyal as the author of #22069, which this reverts, and @tonistiigi as the author of #27609, which I think makes the change in #22069 unnecessary.

euank · 2017-09-25T21:22:55Z

The powerpc failure is a flake. I don't know how to kick the jenkins to re-run

thaJeztah · 2017-09-25T21:47:48Z

@euank Looks like you forgot to sign-off the second commit 😅

also ping @dmcgowan @rhvgoyal @tonistiigi PTAL

rhvgoyal · 2017-09-26T13:01:21Z

cc @amir73il

Ok, so kernel commit 2cac0c00a6cdcc9 started giving errors in case of mount point leaks. Trying to avoid mount point leaks completely will make sense only if mount propagation was synchronous. That is by the time unmount finished, all the propagations and unmounting must have finished before it. If not, then we still have races where a leaked mount has not been unmounted yet (because propagation has not happened yet) and we are trying to create a mount point using same upper directory. I don't know if unmount is synchronous or not.

Secondly, even if you make slave, we will still have the race because runc makes all mount points rprivate by default and leaves a small race window where your unmount will not propagate.

Making it slave/shared in a way makes it more likely that we will leak mount points. Think of an application just doing unshare -m and now all future mounts under /var/lib/docker/overlay2/ will propagate to this application.

I feel we can try to minimize mount point leaks but being able to take care of all corner cases will be very hard. So I am wondering if a solution should come from kernel side instead.

I am wondering will it make sense to convert kernel error message to a warning message? Or is there a way where kernel can figure out that we are essentially using same set of lower/upper/work directories and instead of instantiating a new super block, it re-uses existing super block (which is around due to busy mount). Amir will know much more, I am ccing him.

amir73il · 2017-09-26T14:15:49Z

@rhvgoyal overlayfa is right to complain, so kernel warning is not good enough.
re-using the same super block sounds interesting, but I have no idea how hard it would be
or if that idea will be acceptable.

@tonistiigi The root cause of the problem IMO is the chroot() does not Unmount
all the tree under pivotDir.
You see making all mounts private in the new ns is just the first step.
The cloned overlay mounts, even though private are still in use by the new ns,
so they must be unmounted after pivot_root.
The Mount syscall below is not needed because all mounts in new ns are already private.
The Umount syscall below is wrong because it uses MNT_DETACH instead of
unmounting all tree below pivotDir synchronically.

// Make the pivotDir (where the old root lives) private so it can be unmounted without propagating to the host
if err := syscall.Mount("", pivotDir, "", syscall.MS_PRIVATE|syscall.MS_REC, ""); err != nil {
	return fmt.Errorf("Error making old root private after pivot: %v", err)
}

// Now unmount the old root so it's no longer visible from the new root
if err := syscall.Unmount(pivotDir, syscall.MNT_DETACH); err != nil {
	return fmt.Errorf("Error while unmounting old root after pivot: %v", err)
}

euank · 2017-09-26T17:35:01Z

@rhvgoyal

Trying to avoid mount point leaks completely will make sense only if mount propagation was synchronous.

That's not true. I can reliably reproduce a mount point leak to runc (leaked for 100ms to 1s) without this patch. With this patch I can't. It therefore still makes sense to try and avoid it.
Just because we can't be perfect isn't a reason to do nothing.

I don't know if unmount is synchronous or not.

Because the code here uses MNT_DETACH as of #33638 we're opting in to async behavior either way. I think that PR is technically wrong, but that's something to address separate from this.
I do agree this code is still wrong in theory. I'm simply fixing the instance that is hit reliably in practice.

Secondly, even if you make slave, we will still have the race because runc makes all mount points rprivate by default.

False. runc makes the mountpoints rslave. See either my original issue I say this fixes and my discussion there or my pr against runc and the referenced defaulting code.

Making it slave/shared in a way makes it more likely that we will leak mount points. Think of an application just doing unshare -m and now all future mounts under /var/lib/docker/overlay2/ will propagate to this application.

I disagree. unshare -m by default also sets private on /, so what you say will happen won't. An unshare -m someone runs won't get future mounts either before or after this change. It will break things either way, but I'm trying to fix runc stomping on docker, not arbitrary external users stomping on docker. Before or after this change, external users that don't use rslave will break docker roughly equally badly I think.

If they do unshare -m --propagation slave then they will get new mounts, but the umounts from the host will also propagate, so it's not a big deal. That mountpoint isn't leaked unless the program inside of unshare -m then remounts it private.

I feel we can try to minimize mount point leaks but being able to take care of all corner cases will be very hard. So I am wondering if a solution should come from kernel side instead.

A better solution would be something like @cyphar suggests imo.
I think this current change is an incremental improvement that is necessary to fix real issues though.

Or is there a way where kernel can figure out that we are essentially using same set of lower/upper/work directories and instead of instantiating a new super block, it re-uses existing super block (which is around due to busy mount).

A good first step there would be to not do "mount umount mount umount mount" to start a container, but instead just mount it once and leave it mounted.
That's also a significantly more painful refactor to do when there's already bleeding.

@amir73il

The root cause of the problem IMO is the chroot() does not Unmount
all the tree under pivotDir.

Read my original issue (#34672) where I talk about the actual issue I'm trying to solve. There may be more issues, but the issue I'm trying to fix does not actually use any of that code so it's not closely related to what I see as the root problem.

I agree the things you pointed out are problems (though I think the solution is making it rslave not rprivate), but I don't think that's related to getting this merged or the specific issue I'm trying to fix.

amir73il · 2017-09-26T19:51:54Z

Yeh I am confused of all the different issues and fixes. Anyway, the solution of rslave and lazy umount in cloned ns looks good.

euank · 2017-09-26T20:23:41Z

@amir73il

Yeh I am confused of all the different issues and fixes. Anyway, the solution of rslave and lazy umount in cloned ns looks good.

This PR is for fixing something else though. This is to fix the case where mounting /var/lib/overlay2 as private explicitly means that an unshare -m --propagation rslave, such as the one done by runc, will have a private reference to /var/lib/overlay2/$id/merged, meaning that the overlay2 driver leaks the mount until the program with a cloned mountns exits.

Are you willing to review this change, or do you have specific questions about this issue that isn't answered already by my comments in #34672?

amir73il · 2017-09-27T05:41:32Z

I can only ACK the change that doesn't make home dir private. I don't know enough on the bug picture to say if var/lib/docker/overlay is always shared to begin with.

I suppose if makes little sense to make home dir private for other graph drivers? And the rslave change should be made to chroot() as well. Otherwise this fix is correct but partial. You shouldn't fix just the problem you see in front of you when you know there is a bigger issue to solve. But fine with me of this is the first step.

euank · 2017-09-27T06:00:14Z

@amir73il I apologize if I came off a little roughly. The internet does make it hard to convey tone properly.

I can only ACK the change that doesn't make home dir private

Thanks! What I'm looking for here is making sure this change is in the right direction

... var/lib/docker/overlay is always shared to begin with.

I think it generally will be, especially in a systemd world. It's possible this should be swapping MakePrivate over to MakeShared, though I'm a little more wary about explicit shared than defaulting shared

I suppose if makes little sense to make home dir private for other graph drivers?

It might, but I'm not an expert. I do know there are some related issues floating around for devicemapper at the least.

And the rslave change should be made to chroot() as well

Yup, happy to look at that and changing the MNT_DETACH bit later. One step at a time.

euank · 2017-09-28T20:02:33Z

Review bump for this change

rhvgoyal · 2017-09-28T20:30:44Z

@amir73il Thinking more about it, it feels like regression to me (from userspace point of view). Sure, we can try modifying user space to not leak mount points, but kernel upgrade will still break existing container runtime.

And making sure setup is completely right and none of the mounts point are leaking is hard.

So if I take a device /dev/foo and mount it on /mnt/test, leak the mount point, unmount it and mount again, it still works. It most likely gets same superblock.

IMHO, either we need to implement same behavior for overlay or reduce the level from error to warning.

And this is irrespective of docker changes. Sure try to reduce the amount of leaked mount points, that can only help. But it still feels like a regression.

rhvgoyal · 2017-09-28T20:41:53Z

@amir73il In some ways it feels like DOS to me. If an overlay mount point has leaked somewhere, then root can not create a new overlay mount point and get to its data.

Terminal 1

mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/

Terminal 2

unshare -m

Terminal 1

umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/
mount: /root/overlay-testing/merged: none already mounted or mount point busy

Now root can't access its own data due to leaked mount point in some process mount namespace. And that sounds not desirable to me (given how easy it is to leak mount points).

amir73il · 2017-09-29T05:41:51Z

@rhvgoyal wrt regression you are definitely right. This should be fixed in stable kernels.
I will post a patch to make this new enforcement depend on new config option INDEX=y.

tonistiigi · 2017-10-02T17:37:20Z

cc @kolyshkin

kolyshkin · 2017-10-06T02:15:51Z

In my limited testing, this patch indeed helps with EBUSY on /merged on removal, which I can reproduce easily on a RHEL 7.4 system with sysfs fs.may_detach_mounts = 0 and 100-200 short-lived containers starting/ending in parallel (docker run --rm busybox true).

This is not ultimate fix though as I still see occasional EBUSY on /shm removal (which is somewhat
harder to reproduce).

euank · 2017-10-12T01:41:10Z

@kolyshkin For the reasons described in #34672, I also had to apply this runc patch to fix my issue: opencontainers/runc#1598

Is there a separate issue or bugzilla filed for the /shm ebusy you get with further details? It's possible that runc patch could help further, or that other people could help investigate.

Also, review ping! It seems like this PR has stalled out, but I don't think I've gotten any actionable requests / reasons why it's stalled.

@dmcgowan and @rhvgoyal, want to review this change?

Even if overlayfs lets us get away with leaking mounts, I don't think we should be and so IMO this should still be merged.

euank · 2017-11-21T22:39:14Z

Intentionally leaving the fixup! unsigned-off to be sure I go through and autosquash it before this merges.

I think I've addressed or replied to the review comments.

kolyshkin

LGTM except for a tiny nitpick

kolyshkin · 2017-11-22T20:11:53Z

daemon/graphdriver/overlay/overlay.go

+				}
+				if rmErr := os.RemoveAll(mergedDir); rmErr != nil {
+					logrus.Warnf("Failed to remove %s: %v: %v", id, rmErr, err)
+				}


nit: s/craeted/created/

kolyshkin · 2017-11-22T20:15:33Z

daemon/graphdriver/overlay2/overlay.go

@@ -546,8 +543,9 @@ func (d *Driver) Get(id, mountLabel string) (_ containerfs.ContainerFS, retErr e
 				if mntErr := unix.Unmount(mergedDir, 0); mntErr != nil {
 					logrus.Errorf("error unmounting %v: %v", mergedDir, mntErr)
 				}
-				if rmErr := os.RemoveAll(mergedDir); rmErr != nil {
-					logrus.Warnf("Failed to remove %s: %v: %v", id, rmErr, err)
+				// Cleanup the craeted merged directory; see the comment in Put's rmdir


kolyshkin · 2017-11-22T20:33:25Z

Can you please rebase it?

Signed-off-by: Euan Kemp <euan.kemp@coreos.com>

This removes and recreates the merged dir with each umount/mount respectively. This is done to make the impact of leaking mountpoints have less user-visible impact. It's fairly easy to accidentally leak mountpoints (even if moby doesn't, other tools on linux like 'unshare' are quite able to incidentally do so). As of recently, overlayfs reacts to these mounts being leaked (see One trick to force an unmount is to remove the mounted directory and recreate it. Devicemapper now does this, overlay can follow suit. Signed-off-by: Euan Kemp <euan.kemp@coreos.com>

euank · 2017-11-22T22:42:31Z

rebased, addressed nits, autosquashed

cpuguy83 · 2017-11-23T20:04:14Z

From the Z run:

22:52:17 error loading frozen images: ApplyLayer exit status 1 stdout: stderr: unexpected EOF

Rebuilding.

cyphar · 2017-11-24T01:29:34Z

LGTM.

thaJeztah · 2017-11-24T10:01:28Z

ping @kolyshkin PTAL - nits were addressed

cpuguy83

LGTM

kolyshkin · 2017-11-27T17:59:25Z

LGTM (disclaimer: I'm not a maintainer)

thaJeztah

LGTM, thanks all for reviewing!

doits · 2017-12-10T21:09:09Z

Looks like this did not land in 17.11? I had to apply the patch manually, here's a slightly modified version which applies correctly on 17.11 in case anyone else needs this.

thaJeztah · 2017-12-11T00:45:57Z

No it's not part of 17.11, will be in 17.12

euank requested a review from dmcgowan as a code owner September 22, 2017 21:04

GordonTheTurtle added the status/0-triage label Sep 22, 2017

euank force-pushed the public-mounts branch from 77f0db1 to 5c3333d Compare September 22, 2017 21:09

euank mentioned this pull request Sep 22, 2017

overlay2 + linux v4.13: error creating overlay mount to /var/lib/docker/overlay2/ID/merged: device or resource busy #34672

Closed

euank force-pushed the public-mounts branch from 5c3333d to 699fab4 Compare September 22, 2017 21:15

yongtang added status/2-code-review and removed status/0-triage labels Sep 23, 2017

GordonTheTurtle added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Sep 25, 2017

euank force-pushed the public-mounts branch from 8484f90 to fcd2baa Compare September 25, 2017 23:12

GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Sep 25, 2017

moby deleted a comment from GordonTheTurtle Sep 26, 2017

amir73il mentioned this pull request Sep 29, 2017

docker-17.06.1-ce: error creating overlay mount to /var/lib/docker/overlay2/ID/merged: device or resource busy" coreos/bugs#2127

Closed

GordonTheTurtle assigned vieux Oct 7, 2017

kolyshkin requested changes Nov 22, 2017

View reviewed changes

euank added 2 commits November 22, 2017 14:17

graphdriver/overlay: minor doc comment cleanup

1e214c0

Signed-off-by: Euan Kemp <euan.kemp@coreos.com>

euank force-pushed the public-mounts branch from fddd60f to af0d589 Compare November 22, 2017 22:42

GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Nov 22, 2017

thaJeztah added area/storage/overlay impact/changelog labels Nov 22, 2017

cpuguy83 added the rebuild/z label Nov 23, 2017

GordonTheTurtle removed the rebuild/z label Nov 23, 2017

thaJeztah added the rebuild/windowsRS1 label Nov 24, 2017

GordonTheTurtle removed the rebuild/windowsRS1 label Nov 24, 2017

cpuguy83 approved these changes Nov 27, 2017

View reviewed changes

kolyshkin approved these changes Nov 27, 2017

View reviewed changes

thaJeztah approved these changes Nov 27, 2017

View reviewed changes

thaJeztah added status/4-merge and removed status/2-code-review labels Nov 27, 2017

AkihiroSuda merged commit 09eb7bc into moby:master Nov 29, 2017

thaJeztah mentioned this pull request Dec 1, 2017

[17.09] Fix EBUSY errors under overlayfs and v4.13+ kernels docker-archive/docker-ce#334

Closed

doits added a commit to doits/docker-ce that referenced this pull request Dec 10, 2017

Apply moby/moby#34948 to fix overlay mount errors

0349d47

euank mentioned this pull request Jan 18, 2018

Do not make graphdriver homes private mounts. #36047

Merged

euank deleted the public-mounts branch January 25, 2018 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix EBUSY errors under overlayfs and v4.13+ kernels #34948

Fix EBUSY errors under overlayfs and v4.13+ kernels #34948

euank commented Sep 22, 2017 •

edited

Loading

euank commented Sep 22, 2017 •

edited

Loading

euank commented Sep 25, 2017

thaJeztah commented Sep 25, 2017

rhvgoyal commented Sep 26, 2017

amir73il commented Sep 26, 2017 •

edited

Loading

euank commented Sep 26, 2017 •

edited

Loading

amir73il commented Sep 26, 2017

euank commented Sep 26, 2017

amir73il commented Sep 27, 2017

euank commented Sep 27, 2017

euank commented Sep 28, 2017

rhvgoyal commented Sep 28, 2017

rhvgoyal commented Sep 28, 2017 •

edited

Loading

amir73il commented Sep 29, 2017

tonistiigi commented Oct 2, 2017

kolyshkin commented Oct 6, 2017

euank commented Oct 12, 2017

euank commented Nov 21, 2017 •

edited

Loading

kolyshkin left a comment

kolyshkin Nov 22, 2017

kolyshkin Nov 22, 2017

kolyshkin commented Nov 22, 2017

euank commented Nov 22, 2017 •

edited

Loading

cpuguy83 commented Nov 23, 2017

cyphar commented Nov 24, 2017

thaJeztah commented Nov 24, 2017

cpuguy83 left a comment

kolyshkin commented Nov 27, 2017

thaJeztah left a comment

doits commented Dec 10, 2017

thaJeztah commented Dec 11, 2017

Fix EBUSY errors under overlayfs and v4.13+ kernels #34948

Fix EBUSY errors under overlayfs and v4.13+ kernels #34948

Conversation

euank commented Sep 22, 2017 • edited Loading

euank commented Sep 22, 2017 • edited Loading

euank commented Sep 25, 2017

thaJeztah commented Sep 25, 2017

rhvgoyal commented Sep 26, 2017

amir73il commented Sep 26, 2017 • edited Loading

euank commented Sep 26, 2017 • edited Loading

amir73il commented Sep 26, 2017

euank commented Sep 26, 2017

amir73il commented Sep 27, 2017

euank commented Sep 27, 2017

euank commented Sep 28, 2017

rhvgoyal commented Sep 28, 2017

rhvgoyal commented Sep 28, 2017 • edited Loading

Terminal 1

Terminal 2

Terminal 1

amir73il commented Sep 29, 2017

tonistiigi commented Oct 2, 2017

kolyshkin commented Oct 6, 2017

euank commented Oct 12, 2017

euank commented Nov 21, 2017 • edited Loading

kolyshkin left a comment

Choose a reason for hiding this comment

kolyshkin Nov 22, 2017

Choose a reason for hiding this comment

kolyshkin Nov 22, 2017

Choose a reason for hiding this comment

kolyshkin commented Nov 22, 2017

euank commented Nov 22, 2017 • edited Loading

cpuguy83 commented Nov 23, 2017

cyphar commented Nov 24, 2017

thaJeztah commented Nov 24, 2017

cpuguy83 left a comment

Choose a reason for hiding this comment

kolyshkin commented Nov 27, 2017

thaJeztah left a comment

Choose a reason for hiding this comment

doits commented Dec 10, 2017

thaJeztah commented Dec 11, 2017

euank commented Sep 22, 2017 •

edited

Loading

euank commented Sep 22, 2017 •

edited

Loading

amir73il commented Sep 26, 2017 •

edited

Loading

euank commented Sep 26, 2017 •

edited

Loading

rhvgoyal commented Sep 28, 2017 •

edited

Loading

euank commented Nov 21, 2017 •

edited

Loading

euank commented Nov 22, 2017 •

edited

Loading