Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

gc: fix cleaning mounts and files #3486

Merged
merged 2 commits into from
Jan 5, 2017
Merged

Conversation

lucab
Copy link
Member

@lucab lucab commented Dec 16, 2016

This fixes multiple issues in pods GC. In particular, rkt now tries
harder to clean its environment by:

  • looping over all pods, even if one of them is faulty (eg. some files are busy or missing)
  • leaving manifest removal as last step, so next GC run may succeed
  • chroot-ing to pod before cleaning, to avoid chasing wrong symlinks (eg. ending outside of pod)
  • chdir-ing to (host) root, to avoid keeping directories busy
  • loop-unmounting all mountpoints, as single unmount operations may fail (eg. busy mountpoint)
  • detecting busy mounts, and marking them as detached for lazy unmount (ie. MNT_DETACH)

@euank
Copy link
Member

euank commented Dec 16, 2016

Fixes #3181, right, because rkt rm will succeed with the lazy umount change?

@lucab
Copy link
Member Author

lucab commented Dec 16, 2016

@euank I think so, but have to go and double-check many similar bugs.

I was using coreos/bugs#1630 (comment) as a testcase, which is close to a pathological worse case as it does:

  • full propagation of /
  • in host namespace
  • nesting /run and /usr mounts under root (which are thus back propagated to host root)
  • multiple runs, piling over many mounts

This fixes multiple issues in pods GC. In particular, rkt now tries
harder to clean its environment by:
 * looping over all pods, even if one of them is faulty
 * leaving manifest removal as last step, as next GC runs may succeed
 * chroot-ing to pod before cleaning, to avoid chasing wrong symlinks
 * chdir-ing to (host) root, to avoid keeping directories busy
 * loop-unmounting all mountpoints, even if one umount syscall fails
 * detecting busy mounts, and marking them as detached for lazy unmount
@lucab lucab force-pushed the to-upstream/mount-gc branch from 1fd5d6b to b148180 Compare December 16, 2016 21:24
if err := os.RemoveAll(p.Path()); err != nil {
stderr.PrintE(fmt.Sprintf("unable to remove pod %q", p.UUID), err)
os.Exit(254)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return an err so the exit code can still get set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense. I will check if this can be properly fitted at the caller sites.

return fmt.Errorf("error chroot-ing to %q: %s", targetPath, err)
}
defer syscall.Chroot(".")
defer syscall.Fchdir(rootFd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a central chroot facility? I did the same thing here https://github.com/coreos/rkt/pull/3490/files#diff-0eb4fc0b150ea176f98a75960dd51bceR475?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense for DRY, however I'm a bit worried that introducing wrappers here will only make it harder to follow. Also your chroot helper from there bails out on errors but here I'd prefer to defer-clean unconditionally.

// outside of this rootfs - in descending nest order (parent first)
// 5. unmount all mount targets - in ascending nest order (children first).
// If unmount fails, lazy-detach the mount target so that the kernel can
// still clean it up once it ceases to be busy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice

@lucab
Copy link
Member Author

lucab commented Dec 19, 2016

@euank I'm now reporting a boolean success status, as it was already expected by rm. For gc, it looks like a bit of rework is needed on WalkPods side (and its consumers) to aggregate and report error status. I'm leaving that out of this PR.

@lucab
Copy link
Member Author

lucab commented Jan 5, 2017

I've re-triggered the CI but this should be just the usual flake captured in #3477. This should be ready to go otherwise, PTAL.

@lucab lucab merged commit cd7192e into rkt:master Jan 5, 2017
@jonboulle
Copy link
Contributor

fwiw I did a cursory review of this the other week and it lgtm

@lucab
Copy link
Member Author

lucab commented Jan 5, 2017

@jonboulle thanks for the postumous feedback, I also got and additional double-check offline before merging.

@lucab lucab mentioned this pull request Jan 5, 2017
whereisaaron added a commit to whereisaaron/kube-aws that referenced this pull request Jan 10, 2017
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
@redbaron
Copy link

will it be feasible to add flag to NOT unmount lazily or even consider removing lazy unmount all together? I'd like to have explicit errors in case something is holding resources. the fact that umount/rkt rm succeeds doesn't mean there there is no leak :(

@lucab
Copy link
Member Author

lucab commented Jan 10, 2017

Care to elaborate? If I read this correctly, your usecase would benefit from a --mode=lazy|strict on rkt rm to control specific pod eviction, which sounds reasonable. On the other hand, I'm not sure if your request also touches into GC, where I'd be more reluctant.

@redbaron
Copy link

@lucab I was referring to

detecting busy mounts, and marking them as detached for lazy unmount

This bug possibly together with other bugs managed to create 200k entries in /proc/mounts within ten minutes on my test kubernetes controller node. With that amount of mounts/namespaces creating new busybox container start to take tens of seconds. Lazy unmount will report success for the whole rkt gcprocess, which conceals broken state, but doesn't fix it: these 200k mounts still exist in kernel and occupy resources, the fact that they are not visible/accessible from userspace makes it even worse, as server becomes pretty much unresponsive yet it is very harder to debug why

@lucab
Copy link
Member Author

lucab commented Jan 11, 2017

You are right, but that huge amount of mounts is in the first place because of other rkt pods being nested or not being properly cleaned up, and shared propagation being in place between them. After this, as the GC pass progress, those will be untangled pod after pod. The lazy part in here is needed to get out of the deadlock without a node reboot or forced unmounts.

@redbaron
Copy link

The lazy part in here is needed to get out of the deadlock without a node reboot or forced unmounts.

once you got out of deadlock what state system will be? my understanding is that it just sweeps it under the carpet and no real unmounts happening. If lazy unmount enables kernel to resolve cyclic dependencies or something like that, which makes it do real unmounts, then I am happy with it.

@lucab
Copy link
Member Author

lucab commented Jan 11, 2017

If lazy unmount enables kernel to resolve cyclic dependencies or something like that, which makes it do real unmounts, then I am happy with it.

That's the rationale for this change, see #3465 for a better explanation of the case at hand.

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Mar 27, 2018
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants