Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose rollout for Docker shared PID namespace #207

Merged
merged 4 commits into from
Jan 26, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ container setup that are not currently trackable as Pod constraints, e.g.,
filesystem setup, container image pulling, etc.*

A container in a PodSandbox maps to an application in the Pod Spec. For Linux
containers, they are expected to share at least network and IPC namespaces,
containers, they are expected to share at least network, PID and IPC namespaces,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should change an old design proposal. This should be documented somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change should be reflected somewhere, and SIG Node pointed me here.

I guess the broader question is whether proposals are living documents, but that's out of scope for this PR. If you want me to make the change somewhere else please suggest a place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do update old proposals to reflect new features or changes to design originally proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible to keep all the proposals up-to-date though. Adding a separate spec for CRI anyway has to be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, since CRI v1alpha1 is not out yet, and shared pid namespace is what we agreed for v1alpha1, updating the proposal might be the least effort we could do?

with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).


Expand Down
77 changes: 77 additions & 0 deletions contributors/design-proposals/pod-pid-namespace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Shared PID Namespace

Pods share namespaces where possible, but a requirement for sharing the PID
namespace has not been defined due to lack of support in Docker. Docker began
supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt,
cri-o, hyper) have already implemented a shared PID namespace.

This proposal defines a shared PID namespace as a requirement of the Container
Runtime Interface and links its rollout in Docker to that of the CRI.

## Motivation

Sharing a PID namespace between containers in a pod is discussed in
[#1615](https://issues.k8s.io/1615), and enables:

1. signaling between containers, which is useful for side cars (e.g. for
signaling a daemon process after rotating logs).
2. easier troubleshooting of pods.
3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the
infra container.

## Goals and Non-Goals

Goals include:
- Changing default behavior in the Docker runtime as implemented by the CRI
- Making Docker behavior compatible with the other Kubernetes runtimes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this goal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I had questions about the goals. The first one sounds more like an approach to achieve a goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary goal is to enable the shared PID namespace in the docker runtime. That's what I want to happen.

The secondary goal is to make this consistent across runtimes (where feasible). A shared PID namespace is consistent with the pod abstraction. That's the way I think things should be, but it wasn't my original goal. It was suggested by a reviewer.


Non-goals include:
- Creating an init solution that works for all runtimes
- Supporting isolated PID namespace indefinitely

## Modification to the Docker Runtime

We will modify the Docker implementation of the CRI to use a shared PID
namespace when running with a version of Docker >= 1.12. The legacy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't v1.12 have issues with reparenting? Should we wait for v1.13?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 1.12 reparents orphans incorrectly and they are reaped outside of the container. That's a no-op in our use case since we don't really care what it is that reaps the zombies. Regular zombies are unaffected.

Choosing whatever version > 1.12 fixes this bug would be a reasonable policy choice, but it's not technically necessary. Maybe we could target the next version of docker SIG Node wants to certify for Kubernetes, which I expect won't be until after the CRI because default. If we know that's 1.13, then I can add that here, otherwise we could use a placeholder until it's decided.

`dockertools` implementation will not be changed.

Linking this change to the CRI means that Kubernetes users who care to test such
changes can test the combined changes at once. Users who do not care to test
such changes will be insulated by Kubernetes not recommending Docker >= 1.12
until after switching to the CRI.

Other changes that must be made to support this change:

1. Add a test to verify all containers restart if the infra container
responsible for the PodSandbox dies. (Note: With Docker 1.12 if the source
of the PID namespace dies all containers sharing that namespace are killed
as well.)
2. Modify the Infra container used by the Docker runtime to reap orphaned
zombies ([#36853](https://pr.k8s.io/36853)).

## Rollout Plan

SIG Node is planning to switch to the CRI as a default in 1.6, at which point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to have shared pid namespace by default for CRI v1alpha1 interface. In this case, if the user use docker 1.12 or above, enabling shared pid namespace means early exposure to the end users is good thing for us to catch any potential issues.

On another side, I think there is a real concern that this is a behavior change for Kubernetes container runtime in production. Even other runtime implementations are having shared pid namespace by default today, but in production, the majority of Kubernetes users are not using such features, and we are not sure if this might cause any user issues. If any problem involving with shared pid namespace resulted in a CRI disabling might be too dramatic.

Can we introduce a flag to disable the shared pid namespace instead of disabling entire CRI?

cc/ @yujuhong @verb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I'll update the proposal.

users with Docker >= 1.12 will receive a shared PID namespace by default.
Cluster administrators will be able to disable this behavior by providing a flag
to the kubelet which will cause the dockershim to revert to previous behavior.

The ability to disable shared PID namespaces is intended as a way to roll back
to prior behavior in the event of unforeseen problems. It won't be possible to
configure the behavior per-pod. We believe this is acceptable because:

* We have not identified a concrete use case requiring isolated PID namespaces.
* Making PID namespace configurable requires changing the CRI, which we would
like to avoid since there are no use cases.

In a future release, SIG Node will recommend docker >= 1.12. Unless a compelling
use case for isolated PID namespaces is discovered, we will remove the ability
to disable the shared PID namespace in the subsequent release.


[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-pid-namespace.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->