gc: fix cleaning mounts and files #3486

lucab · 2016-12-16T18:11:01Z

This fixes multiple issues in pods GC. In particular, rkt now tries
harder to clean its environment by:

looping over all pods, even if one of them is faulty (eg. some files are busy or missing)
leaving manifest removal as last step, so next GC run may succeed
chroot-ing to pod before cleaning, to avoid chasing wrong symlinks (eg. ending outside of pod)
chdir-ing to (host) root, to avoid keeping directories busy
loop-unmounting all mountpoints, as single unmount operations may fail (eg. busy mountpoint)
detecting busy mounts, and marking them as detached for lazy unmount (ie. MNT_DETACH)

euank · 2016-12-16T18:19:09Z

Fixes #3181, right, because rkt rm will succeed with the lazy umount change?

lucab · 2016-12-16T18:28:06Z

@euank I think so, but have to go and double-check many similar bugs.

I was using coreos/bugs#1630 (comment) as a testcase, which is close to a pathological worse case as it does:

full propagation of /
in host namespace
nesting /run and /usr mounts under root (which are thus back propagated to host root)
multiple runs, piling over many mounts

This fixes multiple issues in pods GC. In particular, rkt now tries harder to clean its environment by: * looping over all pods, even if one of them is faulty * leaving manifest removal as last step, as next GC runs may succeed * chroot-ing to pod before cleaning, to avoid chasing wrong symlinks * chdir-ing to (host) root, to avoid keeping directories busy * loop-unmounting all mountpoints, even if one umount syscall fails * detecting busy mounts, and marking them as detached for lazy unmount

euank · 2016-12-16T23:57:13Z

rkt/gc.go

 	if err := os.RemoveAll(p.Path()); err != nil {
 		stderr.PrintE(fmt.Sprintf("unable to remove pod %q", p.UUID), err)
-		os.Exit(254)
+		return


Shouldn't this return an err so the exit code can still get set?

Yes, it makes sense. I will check if this can be properly fitted at the caller sites.

s-urbaniak · 2016-12-19T15:26:22Z

common/mount.go

+		return fmt.Errorf("error chroot-ing to %q: %s", targetPath, err)
+	}
+	defer syscall.Chroot(".")
+	defer syscall.Fchdir(rootFd)


Does it make sense to have a central chroot facility? I did the same thing here https://github.com/coreos/rkt/pull/3490/files#diff-0eb4fc0b150ea176f98a75960dd51bceR475?

It may make sense for DRY, however I'm a bit worried that introducing wrappers here will only make it harder to follow. Also your chroot helper from there bails out on errors but here I'd prefer to defer-clean unconditionally.

s-urbaniak · 2016-12-19T15:28:32Z

common/mount.go

+//     outside of this rootfs - in descending nest order (parent first)
+//  5. unmount all mount targets - in ascending nest order (children first).
+//     If unmount fails, lazy-detach the mount target so that the kernel can
+//     still clean it up once it ceases to be busy


lucab · 2016-12-19T15:53:33Z

@euank I'm now reporting a boolean success status, as it was already expected by rm. For gc, it looks like a bit of rework is needed on WalkPods side (and its consumers) to aggregate and report error status. I'm leaving that out of this PR.

lucab · 2017-01-05T11:23:57Z

I've re-triggered the CI but this should be just the usual flake captured in #3477. This should be ready to go otherwise, PTAL.

jonboulle · 2017-01-05T13:26:07Z

fwiw I did a cursory review of this the other week and it lgtm

lucab · 2017-01-05T13:29:20Z

@jonboulle thanks for the postumous feedback, I also got and additional double-check offline before merging.

rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.

redbaron · 2017-01-10T19:16:29Z

will it be feasible to add flag to NOT unmount lazily or even consider removing lazy unmount all together? I'd like to have explicit errors in case something is holding resources. the fact that umount/rkt rm succeeds doesn't mean there there is no leak :(

lucab · 2017-01-10T19:46:46Z

Care to elaborate? If I read this correctly, your usecase would benefit from a --mode=lazy|strict on rkt rm to control specific pod eviction, which sounds reasonable. On the other hand, I'm not sure if your request also touches into GC, where I'd be more reluctant.

redbaron · 2017-01-10T23:40:10Z

@lucab I was referring to

detecting busy mounts, and marking them as detached for lazy unmount

This bug possibly together with other bugs managed to create 200k entries in /proc/mounts within ten minutes on my test kubernetes controller node. With that amount of mounts/namespaces creating new busybox container start to take tens of seconds. Lazy unmount will report success for the whole rkt gcprocess, which conceals broken state, but doesn't fix it: these 200k mounts still exist in kernel and occupy resources, the fact that they are not visible/accessible from userspace makes it even worse, as server becomes pretty much unresponsive yet it is very harder to debug why

lucab · 2017-01-11T11:39:09Z

You are right, but that huge amount of mounts is in the first place because of other rkt pods being nested or not being properly cleaned up, and shared propagation being in place between them. After this, as the GC pass progress, those will be untangled pod after pod. The lazy part in here is needed to get out of the deadlock without a node reboot or forced unmounts.

redbaron · 2017-01-11T11:49:20Z

The lazy part in here is needed to get out of the deadlock without a node reboot or forced unmounts.

once you got out of deadlock what state system will be? my understanding is that it just sweeps it under the carpet and no real unmounts happening. If lazy unmount enables kernel to resolve cyclic dependencies or something like that, which makes it do real unmounts, then I am happy with it.

lucab · 2017-01-11T13:18:26Z

If lazy unmount enables kernel to resolve cyclic dependencies or something like that, which makes it do real unmounts, then I am happy with it.

That's the rationale for this change, see #3465 for a better explanation of the case at hand.

rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.

lucab added component/stage0 needs/review priority/P0 labels Dec 16, 2016

lucab added this to the v1.22.0 milestone Dec 16, 2016

lucab force-pushed the to-upstream/mount-gc branch from 1fd5d6b to b148180 Compare December 16, 2016 21:24

euank reviewed Dec 16, 2016

View reviewed changes

s-urbaniak reviewed Dec 19, 2016

View reviewed changes

s-urbaniak approved these changes Dec 19, 2016

View reviewed changes

rkt/rm: bubble up error status from deletePod

f62afed

lucab mentioned this pull request Dec 19, 2016

fly: full mount propagation causes leftovers and instability #3465

Open

jonboulle mentioned this pull request Dec 23, 2016

Flake: TestAppUserGroup #3477

Open

lucab merged commit cd7192e into rkt:master Jan 5, 2017

lucab mentioned this pull request Jan 5, 2017

rkt 1.22.0 #3510

Merged

f0 mentioned this pull request Jan 6, 2017

systemd[1]: Failed to send queued message: No buffer space available coreos/bugs#1744

Closed

mumoshu mentioned this pull request Jan 10, 2017

Controller node not being properly tainted kubernetes-retired/kube-aws#199

Merged

whereisaaron mentioned this pull request Jan 10, 2017

Change taint-and-uncordon worker task to use docker for now kubernetes-retired/kube-aws#231

Merged

s-urbaniak mentioned this pull request Jan 19, 2017

"sharing" the rkt data directory across a shared bindmount breaks rkt rm #3181

Open

euank mentioned this pull request Apr 12, 2017

rkt-gc.service fail to GC some containers - "device or resource busy coreos/bugs#1825

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc: fix cleaning mounts and files #3486

gc: fix cleaning mounts and files #3486

lucab commented Dec 16, 2016 •

edited

Loading

euank commented Dec 16, 2016

lucab commented Dec 16, 2016

euank Dec 16, 2016

lucab Dec 19, 2016

s-urbaniak Dec 19, 2016

lucab Dec 19, 2016

s-urbaniak Dec 19, 2016

lucab commented Dec 19, 2016

lucab commented Jan 5, 2017

jonboulle commented Jan 5, 2017

lucab commented Jan 5, 2017

redbaron commented Jan 10, 2017

lucab commented Jan 10, 2017

redbaron commented Jan 10, 2017

lucab commented Jan 11, 2017 •

edited

Loading

redbaron commented Jan 11, 2017

lucab commented Jan 11, 2017

gc: fix cleaning mounts and files #3486

gc: fix cleaning mounts and files #3486

Conversation

lucab commented Dec 16, 2016 • edited Loading

euank commented Dec 16, 2016

lucab commented Dec 16, 2016

euank Dec 16, 2016

Choose a reason for hiding this comment

lucab Dec 19, 2016

Choose a reason for hiding this comment

s-urbaniak Dec 19, 2016

Choose a reason for hiding this comment

lucab Dec 19, 2016

Choose a reason for hiding this comment

s-urbaniak Dec 19, 2016

Choose a reason for hiding this comment

lucab commented Dec 19, 2016

lucab commented Jan 5, 2017

jonboulle commented Jan 5, 2017

lucab commented Jan 5, 2017

redbaron commented Jan 10, 2017

lucab commented Jan 10, 2017

redbaron commented Jan 10, 2017

lucab commented Jan 11, 2017 • edited Loading

redbaron commented Jan 11, 2017

lucab commented Jan 11, 2017

lucab commented Dec 16, 2016 •

edited

Loading

lucab commented Jan 11, 2017 •

edited

Loading