-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove MountFlags in systemd unit to allow shared mount propagation #22806
Conversation
Thanks! This may need some eyes to check if there's possible side-effects, so here goes; ping @vbatts @rhatdan @alexlarsson (hi!) @runcom @cpuguy83 @AkihiroSuda @rhvgoyal ptal ❤️ |
Two things.
|
BTW, In fedora systemd unit file we have removed |
cc @rhatdan |
This make a lot of sense, happy to consider to implement that instead of what I did here. My primary concern is about running containerised kubelet, and secondary is that propagation flags feature is actually unusable with systemd unit as is right. |
Right. We wanted to use mount propagation flag feature too to be able to do mounts from inside the containers and be visible in host mount namespace and that's the primary reason to run docker daemon in host mount namespace. |
We have moved docker daemon on RHEL and Fedora to run in the same mount namespace as the host system, in order to allow mount propagation to work. |
Yeah. I would just remove the MountFlags= entirely. It was added due to container mounts showing in the host's mount and, and then if another pid unshared (I.e. cups) then it would hold open the mountinfo. Im guessing that could still be an issue, but forcing the daemon pid and all children to propagate was causing more issues. The only best way to handle this is with directly invoking and not having a daemon in the middle, but that isn't how things are structured. |
What are the user-facing side effects of removing the |
@thaJeztah One user facing impact is with devicemapper as I already mentioned. You might get device busy errors more often w.r.t container exits and deletion. So for these users we will have to ask them to use deferred removal and deferred deletion capabilities. If it works well, we might want to make it default. (Right now these are disabled by default and one has to opt in). We enable it by default on fedora/rhel with the help of docker-storage-setup. |
@thaJeztah LGTM(non-binding) for removing the line |
@errordeveloper do you think it's better just to remove MountFlags? |
I'll test if removing the flag works for my use-case. |
Signed-off-by: Ilya Dmitrichenko <errordeveloper@gmail.com>
I've tested it and it works for me with containerised kubelet, update the commit. |
LGTM |
1 similar comment
LGTM |
Just for reference, EDITOur OS distro is Debian Jessie with 3.16.x kernel. |
Yes on RHEL this will be a problem until we fix the underlying kernel issues. |
I guess we should have reverted this when we found additional issues in RHEL systems. This article explains what the problems are. |
@rhatdan It might be better to handle the mount ns setup in dockerd for finer-grained control and have it work the same regardless of how docker is started. |
@rhatdan Would love to work on this so we can get it right. |
@cpuguy83 Or may be leave it to caller. By running in a private namespace, we also lose some features. For example, shared volume proagation does not does not propagate all the way to host. --live-restore is another issue. So looks like bunch of features expect docker daemon to be running into host mount namespace. At the same time it can cause problems if those mount points leak somewhere else. Given that I have received so many complaints of device being busy with devicemapper driver, a part of me says that docker should always be run in a separate mount namespace to reduce the possibility of accidental mount point leaks and sacrifice the some features. |
PR opened here; #31490 |
* MountFlags=slave preventing us from using live restore functionality moby/moby#22806 (comment) https://access.redhat.com/articles/2938171 * LimitNOFILE=infinity LimitNPROC=infinity not-insignificant performance overhead due to limits being propagated to all children (containerd + containers) moby/moby@8db6109 * Delegate=yes allow docker to manage it's cgroup subtree without systemd interference moby/moby#20152 moby/moby@d16737f * TasksMax=infinity prevent systemd from setting a default task limit of 512 on the engine cgroup, on linux >=4.3 systemd/systemd#1239 systemd/systemd#1886
* MountFlags=slave preventing us from using live restore functionality moby/moby#22806 (comment) https://access.redhat.com/articles/2938171 * LimitNOFILE=infinity LimitNPROC=infinity not-insignificant performance overhead due to limits being propagated to all children (containerd + containers) moby/moby@8db6109 * Delegate=yes allow docker to manage it's cgroup subtree without systemd interference moby/moby#20152 moby/moby@d16737f * TasksMax=infinity prevent systemd from setting a default task limit of 512 on the engine cgroup, on linux >=4.3 systemd/systemd#1239 systemd/systemd#1886
* MountFlags=slave preventing us from using live restore functionality moby/moby#22806 (comment) https://access.redhat.com/articles/2938171 * LimitNOFILE=infinity LimitNPROC=infinity not-insignificant performance overhead due to limits being propagated to all children (containerd + containers) moby/moby@8db6109 * Delegate=yes allow docker to manage it's cgroup subtree without systemd interference moby/moby#20152 moby/moby@d16737f * TasksMax=infinity prevent systemd from setting a default task limit of 512 on the engine cgroup, on linux >=4.3 systemd/systemd#1239 systemd/systemd#1886
* MountFlags=slave preventing us from using live restore functionality moby/moby#22806 (comment) https://access.redhat.com/articles/2938171 * LimitNOFILE=infinity LimitNPROC=infinity not-insignificant performance overhead due to limits being propagated to all children (containerd + containers) moby/moby@8db6109 * Delegate=yes allow docker to manage it's cgroup subtree without systemd interference moby/moby#20152 moby/moby@d16737f * TasksMax=infinity prevent systemd from setting a default task limit of 512 on the engine cgroup, on linux >=4.3 systemd/systemd#1239 systemd/systemd#1886 Signed-off-by: Robert Günzler <robertg@balena.io>
ref #19625 (comment)