Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Isolate kubelet from etcd #860

Merged

Conversation

smarterclayton
Copy link
Contributor

Discusses the current security risks posed by the kubelet->etcd pattern
and discusses some options.

Triggered by #846 and referenced in #859

Does not have to be merged, for discussion and review


## Preferred solution:

Implement the first parts of option 3 - an efficient watch API for the pod, service, and endpoints data for the Kubelet and Kube Proxy. Authorization and authentication are planned in the future - when a solution is available, implement a custom authorization scope that allows API access to be restricted to only the data about a single minion or the service endpoint data. Replace the event publishing mechanism in the kubelet with a polling mechanism or a simple API endpoint and guard it similarly to the other minion specific requests, and ensure the data is correctly attributed to the source. Make the apiserver stateless - this is already a desirable outcome.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this approach well. @jbeda and I were just suggesting migrating to watch on top of the API server today in a different context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of the result. I can't help but feel that etcd will need to solve this ANYWAY and the idea that we're both going to arrive at very similar results is somewhat annoying. I mean, etcd is an API, just a bit lower level. Can that be fixed with less net work?

@smarterclayton
Copy link
Contributor Author

@erictune @thockin any comments down this road?

3. Implement efficient "watch for changes over HTTP" to offer comparable function with etcd
4. Ensure that the apiserver can scale at or above the capacity of the etcd system.
5. Implement authorization scoping for the minions that limits the data they can view

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Teach etcd how to do identity and access-control directly.

@smarterclayton
Copy link
Contributor Author

I've updated with option 4, teach etcd how to do identity and acl. Of note is that it would be a different identity and acl solution than the apiserver, which means that auth(z|n) would have to be managed differently for those calls. It's also less advantageous for isolating the details of the data store from the implementation of the clients. I suspect option 3 is going to be useful in general for other people anyway since it allows apiserver clients to watch for changes to pods on minions for accounting / monitoring / service discovery purposes.

@smarterclayton
Copy link
Contributor Author

Is there any further feedback on this proposal? @lavalamp and I have sorted through most of the remaining issues for Pods, and before that final push I want agreement on the pattern.

@smarterclayton
Copy link
Contributor Author

One open question to @thockin - should what the kubelet "sees" of a pod be fundamentally different than what the client sets for a pod? We discussed in the past having two different ways of looking at pods, but then in other conversations I keep seeing things that are relevant for the kubelet to care about (labels, resourceVersion, id, etc). Assuming this is accepted, perhaps we should do a quick proposal on what the kubelet -> apiserver interface should be and get agreement on that as well prior to me closing out my issue.

@smarterclayton
Copy link
Contributor Author

Just going to add this rather than wait for comments... :)

Modeling the interface that the Kubelet would see to determine the block of pods scheduled for it:

  • The Kubelet needs the Manifest and the pod ID at a minimum, but probably would be better suited by seeing the pod template
  • There may be data that the Kubelet should not see about the pod
  • The set of pods that are scheduled on a minion is transactional - it must be possible to see them as a single unit, and it should be impossible for a kubelet to see a set of pods that have a constraint violation (meaning changes have to be delivered in order)
  • It must be efficient to retrieve and watch those pods

There are two primary ways to model that:

  1. Mimic the existing etcd model of a ContainerManifestList

    # returns an api.PodList or similar, with an atomic version stamp as ResourceVersion
    GET /minionPods/<host>
    # returns an api.PodList or similar, with an atomic version stamp as ResourceVersion
    GET /watch/minionPods/<host>
    
  2. Filter the pod list by the CurrentState.Host and watch on that same filter

    # returns an api.PodList with an atomic version stamp, but not all fields are set
    GET /pods?field=CurrentState.Host%3D<host>
    # returns events with singular pods
    GET /watch/pods?field=CurrentState.Host%3D<host>
    

The subtle difference between the two is that under the covers we are modeling the scheduled pods using a single atomic key so that we can impose constraints atomically. I think that's an important distinction - we guarantee under the covers that you see an atomic set of pods at all times. The latter proposal therefore has to convert an atomic list into a stream of update notifications, which requires it to keep in memory the previous state and do a delta between them. The latter also assumes that the Pods in the API are the same as the pods that are scheduled, and if there is every any delta between the two, then the Pods API would return different values for different states.

After having tried both implementations, I lean much more strongly towards the former. It allows the pods on the minion to be clearly versioned and watched (watch returns a PodList or similar, rather than. It more directly exposes the fact that scheduled pods are an atomic block, and matches the underlying model more closely. It also allows the API of /minionPods to vary from /pods and to be API versioned differently.

@thockin
Copy link
Member

thockin commented Sep 4, 2014

Reading now, sorry this one got lost.

On Wed, Sep 3, 2014 at 9:14 AM, Clayton Coleman notifications@github.com
wrote:

One open question to @thockin https://github.com/thockin - should what
the kubelet "sees" of a pod be fundamentally different than what the client
sets for a pod? We discussed in the past having two different ways of
looking at pods, but then in other conversations I keep seeing things that
are relevant for the kubelet to care about (labels, resourceVersion, id,
etc). Assuming this is accepted, perhaps we should do a quick proposal on
what the kubelet -> apiserver interface should be and get agreement on
that as well prior to me closing out my issue.

Reply to this email directly or view it on GitHub
#860 (comment)
.

@thockin
Copy link
Member

thockin commented Sep 4, 2014

re: API, just a quick sketch of how my brain wants the API to be factored.

https://github.com/thockin/kubernetes/compare/api_proposal

On Wed, Sep 3, 2014 at 10:39 PM, Tim Hockin thockin@google.com wrote:

On one hand, we have asked CoreOS folks for ACLs on etcd. On the other
hand, we're proposing to abstratct them away here. I don't want to take
this decision too lightly - it really has repercussions. On the other
hand, I am a big fan of controlling my own destiny.

Option 2 seems like the sort of thing etcd folks could help with. They
want to be able to serve this role. If they gave us a null-proxy and let
us on it, that makes option 2 a lot more plausible.
Perhaps that is how we implement (prototype) ACLs.

Option 3 is clearly the most work, but the most thorough abstraction.

The biggest argument against option 4 is that it leaves Kubelets coupled
to etcd. I'm not very worried about that, really. I mean, great if we
don't, but kubelet already supports an array of input modes.

I'm OK with option 3, but I want us to think really hard about how much
work it will be vs option 2 (or a hybrid of 2 & 4 (which I think the etcd
folks would be motivated to help with). This change sets a tone for the
role of etcd in kubernetes for the long term.

To the open question about what Kubelet sees of a pod. It is my belief
that we would do well to separate the structs that we receive from a user
from the structs we store over time from the structs we publish to kubelet.
They might overlap a lot or contain one another for convenience, but I
think they are different concepts that we model poorly today. I think this
can/should be fixed regardless of which approach we take on this proposal.

For example, I find it weird and annoying that creating a pod requires me
to say "desiredState". Of course it is the desired state. I find it weird
and annoying that we have this structure of Pod { PodState { Manifest } },
but we only show Manifest to Kubelet, which does not have proper ID or name.

I'd much rather show Kubelet a BoundPod structure, which looks a whole lot
like a Pod with some fields added and some removed.

I'd much rather a user POST a PodSpec structure, which looks like a
Manifest with some extra fields, and internally convert it to a PodState,
which was a superset of PodSpec. I am not sure what REST rules say about
that.

I can get more concrete if needed, I guess.

On Wed, Sep 3, 2014 at 9:59 PM, Tim Hockin thockin@google.com wrote:

Reading now, sorry this one got lost.

On Wed, Sep 3, 2014 at 9:14 AM, Clayton Coleman <notifications@github.com

wrote:

One open question to @thockin https://github.com/thockin - should
what the kubelet "sees" of a pod be fundamentally different than what the
client sets for a pod? We discussed in the past having two different ways
of looking at pods, but then in other conversations I keep seeing things
that are relevant for the kubelet to care about (labels, resourceVersion,
id, etc). Assuming this is accepted, perhaps we should do a quick proposal
on what the kubelet -> apiserver interface should be and get
agreement on that as well prior to me closing out my issue.

Reply to this email directly or view it on GitHub
#860 (comment)
.

@smarterclayton
Copy link
Contributor Author

Can't comment directly, doing it here

https://github.com/thockin/kubernetes/compare/api_proposal?diff=unified#diff-9ff06ad723720f9428a65da3710cf436R60

This seems like PodTemplate

https://github.com/thockin/kubernetes/compare/api_proposal?diff=unified#diff-9ff06ad723720f9428a65da3710cf436R76

Agree this is a good separation.

https://github.com/thockin/kubernetes/compare/api_proposal?diff=unified#diff-9ff06ad723720f9428a65da3710cf436R129

I think I'd been leaning this way as well, in which case we're talking about a distinct resource for pods anyway and we need to clearly call it out as separate. I don't see pods being bound to multiple servers in the Kube system (disagreement?).

In general I don't disagree with an alignment like that.

@bgrant0607
Copy link
Member

See also #1178 -- API support for diff'ing, which also includes some cleanup/refactoring. It covers all the objects, not just pods, and proposes nesting of JSONBase within a field and moving labels there, among other things.

The pod template issue is #170. I agree that PodSpec looks similar. I don't get why it includes JSONBase. It should be embedded into both Pod and PodTemplate.

Why is the manifest in PodStatus?

Does BoundPod really need to be different from Pod? I suppose eventually we'll want Kubelet to return different CurrentStatus for the pod than the apiserver. If we split desired and current state into completely separate messages/types, we'd only need to fork the current state types, but I suppose that would be less RESTful.

Ideally, Kubelet would follow the same API conventions as our other APIs, however. I'd like it to be possible to target "free-range" Kubelets with the config system.

@thockin
Copy link
Member

thockin commented Sep 5, 2014

It feels like we are converging on a largish API overhaul. Who is going to be responsible for owning the process of collecting changes, weighing options, and writing up a proposal? I'll do it, but I'm afraid I am already stretched thin...

@smarterclayton
Copy link
Contributor Author

I've got some bandwidth to do so. Right now I see the diff issue, name and namespace, pod -> minion, and resource version on all operations. Others?

@smarterclayton
Copy link
Contributor Author

And pod template

@lavalamp
Copy link
Member

lavalamp commented Sep 5, 2014

If we're doing a large API overhaul, we should cut version v1beta2 right now with our current api. (so that these big changes will land in v1beta3.)

I can improve the conversion functions to add any needed functionality, right now it's not super easy to move info between hierarchical levels.

@bgrant0607
Copy link
Member

@lavalamp What changes would go into v1beta2?

@dchen1107 also urgently needs some API changes.

@thockin
Copy link
Member

thockin commented Sep 6, 2014

OK, I annoint @smarterclayton as the cat-herder for API proposals for
v1beta3. I suggest we collect a handful of short design issues and then
start to work out a new version of api/types.go, and plans for how to make
each change.

On my list, so far

  • refactoring pods/manifests
  • distinguishing the thing we give to kubelet from the thing we hold in the
    API
  • distinguishing what we accept from users from the thing we hold in the API
  • names and IDs
  • a common pattern for "one of" plugins
  • clarify JSONBase wrt lists

We should set a relatively short time horizon to get ideas written up, some
window for discussion, and then done. We can not still be talking about
this come October.

On Fri, Sep 5, 2014 at 1:41 PM, bgrant0607 notifications@github.com wrote:

@lavalamp https://github.com/lavalamp What changes would go into
v1beta2?

@dchen1107 https://github.com/dchen1107 also urgently needs some API
changes.

Reply to this email directly or view it on GitHub
#860 (comment)
.

@smarterclayton
Copy link
Contributor Author

Discussion on each issue started next week, with references in Kubernetes-dev. Rough concrete examples of API syntax for folks to debate next week. Identification of anything out of scope next week. Debate and refinement into week after?

@bgrant0607 bgrant0607 added kind/design Categorizes issue or PR as related to design. kind/documentation Categorizes issue or PR as related to documentation. labels Sep 10, 2014
@bgrant0607 bgrant0607 added this to the v0.9 milestone Oct 4, 2014
@bgrant0607 bgrant0607 added area/api Indicates an issue on api area. area/security area/etcd labels Oct 4, 2014
@kubernetes-bot
Copy link

Can one of the admins verify this patch?

@bgrant0607 bgrant0607 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 3, 2014
@bgrant0607
Copy link
Member

Could we update this doc with the current plan and put it in the design doc directory?

@smarterclayton
Copy link
Contributor Author

Yes, will do.

Discusses the current security risks posed by the kubelet->etcd pattern
and discusses some options.

Triggered by kubernetes#846 and referenced in kubernetes#859
@smarterclayton
Copy link
Contributor Author

Updated (minor edits, can be more drastic if needed).

erictune added a commit that referenced this pull request Jan 6, 2015
@erictune erictune merged commit 58abb40 into kubernetes:master Jan 6, 2015
@smarterclayton smarterclayton deleted the isolate_kubelet_from_etcd branch February 11, 2015 02:21
vishh pushed a commit to vishh/kubernetes that referenced this pull request Apr 6, 2016
Always to a live lookup of version info instead of caching.
k8s-github-robot pushed a commit that referenced this pull request May 31, 2018
Automatic merge from submit-queue (batch tested with PRs 61803, 64305, 64170, 64361, 64339). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Improve the help of kubeadm completion

**What this PR does / why we need it**:
Add note that 'bash-completion' is required on Linux too.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes [kubernetes/kubeadm/#860](kubernetes/kubeadm#860)

**Special notes for your reviewer**:
cc @neolit123

**Release note**:
```release-note
NONE
```
b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this pull request Feb 5, 2021
Makefile: Cleanup, alpine and amd64 only UDP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/etcd area/security kind/design Categorizes issue or PR as related to design. kind/documentation Categorizes issue or PR as related to documentation. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants