-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Node Readiness Gates #1003
Conversation
8036265
to
8bc2200
Compare
type Node struct { | ||
… | ||
Spec: | ||
SystemExtensions []SystemExtension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably follow the same naming scheme as pod readiness gates and change this to something like
Spec:
ReadinessGates: []NodeReadinessGate
Thoughts @vishh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for coming up with a generic name was to facilitate other use cases like identifying system addons for introspection/monitoring. If the naming scheme is too confusing, I don't mind changing it to be more specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @andrewsykim - Normalization helps. This would avoid confusion for users.
/sig scheduling for visibility |
Thank you for your KEP. |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial draft looks good. I'd say try implementing a prototype to ensure the design is implementable and then come back to refining this KEP further.
|
||
One of the goals for this proposal is to evaluate the viability and usefullness of combining Status and Health for extensions. | ||
|
||
### Approach B - Use Readiness of Node extension pods (Preferred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Place this ahead of the previous approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of adding more details/content to the preferred approach (approach B) and condensing the other ones under an "Alternatives" section. Outlining all the other approaches in as much details as now makes it difficult to read IMO. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
type Node struct { | ||
… | ||
Spec: | ||
SystemExtensions []SystemExtension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for coming up with a generic name was to facilitate other use cases like identifying system addons for introspection/monitoring. If the naming scheme is too confusing, I don't mind changing it to be more specific.
} | ||
``` | ||
|
||
Node selector will take into account `Readiness` all the pods that match the selector(s) specified in the Node Spec `SystemExtensions` field to determine Node Readiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Node Selector/Node Controller/g
|
||
TODO | ||
|
||
## Design Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to include a Production Maintenance section that talks about the health metrics and debugging tips to safely rollout and manage this feature in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments/suggestions
type Node struct { | ||
… | ||
Spec: | ||
SystemExtensions []SystemExtension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @andrewsykim - Normalization helps. This would avoid confusion for users.
} | ||
|
||
type SystemExtension struct { | ||
Selector LabelSelector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take back what I said, this would make the kubelet watch priorityClass
.
type SystemExtension struct { | ||
Selector LabelSelector | ||
Namespace Namespace | ||
AffectsReadiness bool // Ignore non-critical node extensions from node readiness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a usecase behind AffectsReadiness
(i.e why reference a set of pods on the nodeReadinessGate
if they do not affect node readiness)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original use case was to identify all system pods on a node including the ones whose health should not impact node health (like non-critical monitoring services for example)
Node selector will take into account `Readiness` all the pods that match the selector(s) specified in the Node Spec `SystemExtensions` field to determine Node Readiness. | ||
|
||
The individual system extensions are required to detect kubelet crash (or restart) quickly and reflect that in their Readiness. | ||
If the Node Controller observes that all the system extension pods are Ready after the kubelet places the special meta readiness taint, it will remove that taint immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Node Controller/NodeLifecycle Controller/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we tolerate a subset of conditions ?
What is the timeline for this? I want to review, but am buried. |
@thockin enhancement freeze for v1.15 is tomorrow so I think this will have to wait til v1.16. I'll poke you for a review when it's at a better state and when we all have a bit more bandwidth :) |
How to solve the race condition? Some critical pods might have not created on node before kubelet or controller judge node "Ready". |
@DaiHao with the current proposal, existing node readiness will allow system pods to be scheduled, whereas application pods will be scheduled only after "meta readiness" is satisfied. |
|
||
Handling of kubelet restarts can be tricky though since there is a chance that the extension pods may not reflect their connections with the kubelet soon after the kubelet restarts. | ||
|
||
### Approach C - Use conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also contrast using taints? As I understand it, Conditions was the plan, then we realized we needed a mechanism to tolerate certain Conditions, and so we introduced taints & tolerations (which are also generally useful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, taints will be used in the current preferred approach.
|
||
### Approach B - Use Readiness of Node extension pods (Preferred) | ||
|
||
This approach assumes that all node level extensions will get deployed as kubernetes pods and that the health of those extension pods can be exposed via existing Readiness probes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it sounds like the model envisaged is pod-per-Node. I do think the selector would enable selection of a shared pod, but it isn't clear that a pod could express that some subset of nodes are bad and some are good.
I think anything that programs infrastructure would end up needing a "proxy pod" per node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in an ideal world all critical system daemons would be run as pods (including k'let :) ). Until that happens, there can be placeholder pods that reflect the health of system daemons not run as pods for the purposes of monitoring/introspection.
... | ||
} | ||
|
||
type SystemExtension struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably you need a merge-key in here (e.g. name
)?
Thanks for the feedback, will be updating this KEP based on comments shortly. |
8bc2200
to
0cb9434
Compare
54190a2
to
ce535ce
Compare
The `ReadinessGates` field can be used by external components to dictate node readiness since they can apply an arbitrary readiness gate with a reserved label selector | ||
that should not be used by any pods on the cluster. A common usecase here is an external cloud provider that wants to ensure no pods are scheduled on a node until | ||
networking for a given node is properly set up by the cloud provider. In this case, the cloud provider would apply any custom readiness gates to a node during registration | ||
and a separate controller would remove the readiness gate when it sees that node's networking has been properly configured. No pods matching the label selector of that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of totally a gross hack. The cloud provider tells the node "I won't be ready until event X happens", knowing full well that event X will never happen, and then just says "never mind, forget I said that" when it's ready. It's not using the defined mechanism, it's abusing it.
And if it has permission to modify the Node object anyway, couldn't it just apply and then remove a taint directly rather than fiddling with the readiness gates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if it has permission to modify the Node object anyway, couldn't it just apply and then remove a taint directly rather than fiddling with the readiness gates?
There's time between when the cloud provider can apply that initial taint and when the node is initialized. It's true that a cloud provider can apply custom taints to achieve the same behavior but it will always race against the scheduler for pods unless we add the gate during initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of totally a gross hack. The cloud provider tells the node "I won't be ready until event X happens", knowing full well that event X will never happen, and then just says "never mind, forget I said that" when it's ready. It's not using the defined mechanism, it's abusing it.
I have mixed feelings about this... yes it's not exactly using the feature the way it is fully intended but it satisfies most of the use-cases around readiness - which is waiting for system critical pods (usually DaemonSets) - while also allowing for the external cloud provider case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to a cloud provider specific integration for this where we allow the cloud provider to do initialization in 2 steps - one for node addresses, topology, etc and another for node readiness - but I would rather have a generic solution that fixes this for many other use-cases.
A readiness taint will be used to track the overall readiness of a node. The kubelet, node lifecycle controller, other external controllers will | ||
manage that taint. | ||
|
||
The Node object will be extended to include a list of readiness gates like so: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for having completely unique readiness gates for each node, but it seems much more likely that you'd have a small set of distinct node types, where each type had its own readiness gates. And each node type would probably already be distinguishable by its labels. So having something that selects Nodes to apply readiness gates to them seems like it would be easier to manage than something that requires modifying the Nodes (especially since to avoid race conditions, the thing setting the node.Spec.ReadinessGates
basically has to be in kubelet itself, whereas if you do it the other way, you can create the selectors well before creating the nodes).
And in fact, it seems like really we already have a thing that selects nodes that need special initialization: DaemonSets. You could just add the BlocksReadiness
field to DaemonSetSpec
, defaulting to false
. And then a node would be "meta-ready" when all readiness-blocking DaemonSets that selected that node had successfully scheduled their pods to that node and the pods reported being ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in fact, it seems like really we already have a thing that selects nodes that need special initialization: DaemonSets. You could just add the BlocksReadiness field to DaemonSetSpec, defaulting to false. And then a node would be "meta-ready" when all readiness-blocking DaemonSets that selected that node had successfully scheduled their pods to that node and the pods reported being ready.
I like this idea but I do think there should still be a way to set the readiness gate outside of DaemonSets also. Maybe for an initial implementation we can only allow readiness gates from cloud providers and from DaemonSets and if other use-cases come up we can add support later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should something like blocksReadiness
on a DaemonSet be allowed only on namespaces like kube-system
? With this, any user with access to DaemonSets could prevent pods from scheduling on the entire cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... yeah, there would need to be some sort of protection there, but it can't be as simple as "limited to kube-system
". I'm not sure what approach would be consistent with other features. ("Limited to namespaces with the annotation X
?")
// The name of the node readiness gate | ||
// +optional | ||
Name string `json:"name" protobuf:"bytes,1,opt,name=name"` | ||
// Whether a node is ready based on it's corresponding readiness gates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"its". But also, "gates" plural makes it sound like this is the aggregated ready state, but it's not. "Whether the node is ready based on this readiness gate
".
And neither field should be +optional
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updated.
|
||
One limitation in this design is that all system critical pods that should gate the readiness of a node must be updated to tolerate the readiness taint. Otherwise, | ||
those pods cannot be scheduled on a node since the readiness taint would already be applied by the kubelet on registration. If users forget to apply this toleration, | ||
they may accidentally gate the readiness of their nodes, preventing workloads from being scheduled there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems as though any pod that blocks node readiness would almost necessarily have to tolerate all taints... Otherwise, eg, running low on disk space would cause all the node-readiness-gating pods to be evicted, which would then cause the node to become meta-unready, which would then cause most of the other pods to be evicted (even if they explicitly indicated that they can tolerate low disk space).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IME system critical pods tend to already tolerate all taints - but agreed this can be unexpected in many cases. If we specify readiness gate at the DaemonSet level like you suggested, we could set the right toleration when the DaemonSet is created? That doesn't sound ideal either because a DaemonSet would end up tolerating taints it shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we specify readiness gate at the DaemonSet level like you suggested, we could set the right toleration when the DaemonSet is created?
Or just require the user to create it, and reject it in validation if they didn't.
That doesn't sound ideal either because a DaemonSet would end up tolerating taints it shouldn't.
Maybe there are use cases for a node-readiness-blocking pod to tolerate some but not all taints. Anyway, the validation check could just check that it tolerates at least the meta-ready taint, rather than that it tolerates all taints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation for tolerations sgtm
|
||
The following journey attempts to illustrate this solution a bit more in detail: | ||
|
||
1. Kubelet starts up on a node and immediately taints the node with a special “kubernetes.io/MetaReadiness” taint with an effect of “NoSchedule” in addition to flipping it’s “Ready” condition to true (when appropriate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "ApplicationReadiness" rather than "MetaReadiness"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and not NodeReadiness?
Signed-off-by: Andrew Sy Kim <kiman@vmware.com> Co-authored-by: Vish Kannan <vishnuk@google.com>
ce535ce
to
ee9f3e3
Compare
@andrewsykim do you think the KEP could be approved for the next release? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I agree with the premise of this. I have many thoughts about the design.
Second, it might be nice to enumerate even more concretely the use cases. Which things that we currently do would be replaced by this? Which things that people WANT to do would this enable? The scope of the list will help justify the complexity tradeoffs.
It's not quite like pod readiness, which uses a list of condition names. Why not? E.g. Every component registers a condition name in node.spec.readinessGates. Kubelet applies a taint until all conditions are present and True. This would force daemonsets to have a cluster-controller that tests whether that DS is present and ready on a given node and update the condition. Kind of bleh. As a refinement, each readiness gate could be a condition name OR a pod selector (or the name of a daemonset, which already has a selector).
I don't think modifying readiness gates on the fly for non-daemonset conditions is going to fly past API review.
As an alternative, why do we need more than taints? Let each readiness gate apply a taint on the node. NetworkNotReady, LoggingNotEnabled, etc. Why is that not sufficient?
Is this just about Node startup or are we talking about a real-time signal? E.g. is it a one-way, latched transition (node starts unready and becomes ready) or bi-directional (continuously evaluated)? If a critical daemonset becomes unready, does the taint come back? What happens to running pods which are, presumably, malfunctioning now?
It's much simpler if we assume, like pod readiness, that it's a "birth time" one-way flip. Can we get away with that?
I have to admit - you lost me with the discussion of special leases.
Is this something we want to press on in the new year? Maybe we should do a realtime discussion on it?
Hi @thockin thank you for having a look at this KEP. We are very interested in this feature (we are currently working on using https://github.com/uswitch/nidhogg, which works fine but requires some work to manage the scale we run at). A few example use-cases for us:
InitContainers failing is not a very big problem because they work after a few retries. The other 2 are harder and several applications implement retry logic to address this (which of course they will have to do anyway in case the pod providing the service fails during the application lifetime). The hardest scenario is missing cloud credentials because most sdks will consider they don't exist and will continue without looking for them (if such a pod fails later, it's less of a problem because credentials are valid for some time and sdks will retry). Having a solution for node start-up only would address most of our problems |
ReadinessGates []NodeReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,7,opt,name=readinessGates"` | ||
} | ||
|
||
// NodeReadinessGate indicates a set of pods that must be ready on a given node before the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the selector may select more than one pod, what is the expected behavior if not all of those selected pods are ready?
Is the gate ready if at least one of the selected pods is ready?
Or if all of the selected pods are ready?
If it's all, then this requires first calculating the set of system-critical daemonset pods that are expected to be scheduled on the node, before trying to evaluate the node readiness. And kubelet can't do that, only the scheduler can. So the all semantics would require the cooperation of the scheduler to determine node readiness.
In any case, this has to be clarified here.
} | ||
``` | ||
|
||
For each `NodeReadinessGate`, kubelet will fetch the state of all pods that match its label selector and take the readiness state of those pods into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think kubelet alone can determine the readiness, because readiness will depend on the set of system-critical daemonset pods that are to be scheduled on the node, and that is determined by the scheduler.
So I think this requires cooperation from the scheduler.
Otherwise, kubelet may declare the node ready even though a critical daemonset has not yet been scheduled to run on the node.
... | ||
// Defines the list of readiness gates for a node | ||
// +optional | ||
ReadinessGates []NodeReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,7,opt,name=readinessGates"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ignore the implementation details for a moment, this API is not necessary.
The set of pods to wait for is already clearly specified by users: it is the set of pods with a system-cluster-critical
or system-node-critical
priorityClass
.
We shouldn't need any additional input.
I feel that this ReadinessGates
API would just be used to replicate the information already specified in priorityClass
: if a node has a critical priorityClass
it should be matched by a ReadinessGates
, and vice-versa I can't think of a case where a pod matched by a ReadinessGates
shouldn't have a critical priorityClass
.
It seems that introducing ReadinessGates
is driven by the desire to implement this 100% in kubelet
. However, if the implementation involved the scheduler to determine the set of pods with a critical priorityClass
to be scheduled on the node, that new API wouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @vishh. I was making the assumption that we'd have to deal with only one scheduler, because I was considering only daemonset pods. But if we want to deal with pods other than daemonsets, which are scheduled by different schedulers, then this becomes complicated because all schedulers would need to support this logic.
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
This needs to be fleshed out more, I'll try to get to it in the next week or so but no guarantees. If other folks want to contribute please reach out :) |
I've been steeping on this for a bit and I think a "conditions/taints hook" from the cloud controllers is sufficient enough for the use-cases we need (at least for the cloud providers). The most common use-case I hear is preventing nodes from accepting new pods before the cloud provider registers routes for that node. We can allow cloud providers to add a "no routes" taint (or use the existing NetworkUnavailable node condition) assuming the routes controller will remove it later. Worth noting that the GCP provider does this today but via the kubelet. We should make this functionality available to all the external providers. I will open a separate KEP for this use-case. Cloud providers aside, it seems like there are other valid use-cases (e.g. waiting for system critical pods) worth considering but all of these can be addressed today by passing a custom taint to kubelet (via /close |
@andrewsykim: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Andrew Sy Kim kiman@vmware.com
Co-authored-by: Vish Kannan vishnuk@google.com
Porting over the KEP for node readiness gates that @vishh worked on here kubernetes/community#2640. I'll be updating it with his guidance in the new few days so I wanted to open a WIP PR in-case folks wanted to provide early feedback.
/sig node
cc @vishh @chenk008 @yastij