Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Node Readiness Gates #1003

Closed

Conversation

andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Apr 25, 2019

Signed-off-by: Andrew Sy Kim kiman@vmware.com
Co-authored-by: Vish Kannan vishnuk@google.com

Porting over the KEP for node readiness gates that @vishh worked on here kubernetes/community#2640. I'll be updating it with his guidance in the new few days so I wanted to open a WIP PR in-case folks wanted to provide early feedback.

/sig node
cc @vishh @chenk008 @yastij

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 25, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Apr 25, 2019
@andrewsykim andrewsykim force-pushed the node-readiness-gates branch from 8036265 to 8bc2200 Compare April 25, 2019 22:47
type Node struct {
Spec:
SystemExtensions []SystemExtension
Copy link
Member Author

@andrewsykim andrewsykim Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably follow the same naming scheme as pod readiness gates and change this to something like

Spec:
   ReadinessGates: []NodeReadinessGate

Thoughts @vishh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for coming up with a generic name was to facilitate other use cases like identifying system addons for introspection/monitoring. If the naming scheme is too confusing, I don't mind changing it to be more specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @andrewsykim - Normalization helps. This would avoid confusion for users.

@yastij
Copy link
Member

yastij commented Apr 26, 2019

/sig scheduling
/assign

for visibility

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Apr 26, 2019
@chenk008
Copy link
Contributor

Thank you for your KEP.
#76946 In this scenario,the node's Readiness depends on the CCM's route_controller .
In Approach B: we need a pod to judge the route is created successfully.
I think Approach D LGTM

@vishh
Copy link
Contributor

vishh commented Apr 26, 2019

/assign

Copy link
Contributor

@vishh vishh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial draft looks good. I'd say try implementing a prototype to ensure the design is implementable and then come back to refining this KEP further.


One of the goals for this proposal is to evaluate the viability and usefullness of combining Status and Health for extensions.

### Approach B - Use Readiness of Node extension pods (Preferred)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Place this ahead of the previous approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of adding more details/content to the preferred approach (approach B) and condensing the other ones under an "Alternatives" section. Outlining all the other approaches in as much details as now makes it difficult to read IMO. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

type Node struct {
Spec:
SystemExtensions []SystemExtension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for coming up with a generic name was to facilitate other use cases like identifying system addons for introspection/monitoring. If the naming scheme is too confusing, I don't mind changing it to be more specific.

}
```

Node selector will take into account `Readiness` all the pods that match the selector(s) specified in the Node Spec `SystemExtensions` field to determine Node Readiness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Node Selector/Node Controller/g


TODO

## Design Details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to include a Production Maintenance section that talks about the health metrics and debugging tips to safely rollout and manage this feature in production.

Copy link
Member

@yastij yastij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments/suggestions

type Node struct {
Spec:
SystemExtensions []SystemExtension
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @andrewsykim - Normalization helps. This would avoid confusion for users.

}

type SystemExtension struct {
Selector LabelSelector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to discuss/prototype relying on higher abstraction to select pods like priorityClass (e.g. Gate node readiness with pods that have node-critical priorityClass) @vishh

cc @bsalamat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take back what I said, this would make the kubelet watch priorityClass.

type SystemExtension struct {
Selector LabelSelector
Namespace Namespace
AffectsReadiness bool // Ignore non-critical node extensions from node readiness
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a usecase behind AffectsReadiness (i.e why reference a set of pods on the nodeReadinessGate if they do not affect node readiness)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original use case was to identify all system pods on a node including the ones whose health should not impact node health (like non-critical monitoring services for example)

Node selector will take into account `Readiness` all the pods that match the selector(s) specified in the Node Spec `SystemExtensions` field to determine Node Readiness.

The individual system extensions are required to detect kubelet crash (or restart) quickly and reflect that in their Readiness.
If the Node Controller observes that all the system extension pods are Ready after the kubelet places the special meta readiness taint, it will remove that taint immediately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Node Controller/NodeLifecycle Controller/g

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we tolerate a subset of conditions ?

@thockin
Copy link
Member

thockin commented Apr 29, 2019

What is the timeline for this? I want to review, but am buried.

@andrewsykim
Copy link
Member Author

@thockin enhancement freeze for v1.15 is tomorrow so I think this will have to wait til v1.16. I'll poke you for a review when it's at a better state and when we all have a bit more bandwidth :)

@DaiHao
Copy link

DaiHao commented May 6, 2019

How to solve the race condition? Some critical pods might have not created on node before kubelet or controller judge node "Ready".

@vishh
Copy link
Contributor

vishh commented May 6, 2019

@DaiHao with the current proposal, existing node readiness will allow system pods to be scheduled, whereas application pods will be scheduled only after "meta readiness" is satisfied.


Handling of kubelet restarts can be tricky though since there is a chance that the extension pods may not reflect their connections with the kubelet soon after the kubelet restarts.

### Approach C - Use conditions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also contrast using taints? As I understand it, Conditions was the plan, then we realized we needed a mechanism to tolerate certain Conditions, and so we introduced taints & tolerations (which are also generally useful).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, taints will be used in the current preferred approach.


### Approach B - Use Readiness of Node extension pods (Preferred)

This approach assumes that all node level extensions will get deployed as kubernetes pods and that the health of those extension pods can be exposed via existing Readiness probes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it sounds like the model envisaged is pod-per-Node. I do think the selector would enable selection of a shared pod, but it isn't clear that a pod could express that some subset of nodes are bad and some are good.

I think anything that programs infrastructure would end up needing a "proxy pod" per node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in an ideal world all critical system daemons would be run as pods (including k'let :) ). Until that happens, there can be placeholder pods that reflect the health of system daemons not run as pods for the purposes of monitoring/introspection.

...
}

type SystemExtension struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably you need a merge-key in here (e.g. name)?

@andrewsykim
Copy link
Member Author

Thanks for the feedback, will be updating this KEP based on comments shortly.

@andrewsykim andrewsykim force-pushed the node-readiness-gates branch from 8bc2200 to 0cb9434 Compare July 3, 2019 18:20
@andrewsykim andrewsykim force-pushed the node-readiness-gates branch from 54190a2 to ce535ce Compare July 9, 2019 15:23
@thockin thockin self-assigned this Jul 11, 2019
The `ReadinessGates` field can be used by external components to dictate node readiness since they can apply an arbitrary readiness gate with a reserved label selector
that should not be used by any pods on the cluster. A common usecase here is an external cloud provider that wants to ensure no pods are scheduled on a node until
networking for a given node is properly set up by the cloud provider. In this case, the cloud provider would apply any custom readiness gates to a node during registration
and a separate controller would remove the readiness gate when it sees that node's networking has been properly configured. No pods matching the label selector of that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of totally a gross hack. The cloud provider tells the node "I won't be ready until event X happens", knowing full well that event X will never happen, and then just says "never mind, forget I said that" when it's ready. It's not using the defined mechanism, it's abusing it.

And if it has permission to modify the Node object anyway, couldn't it just apply and then remove a taint directly rather than fiddling with the readiness gates?

Copy link
Member Author

@andrewsykim andrewsykim Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if it has permission to modify the Node object anyway, couldn't it just apply and then remove a taint directly rather than fiddling with the readiness gates?

There's time between when the cloud provider can apply that initial taint and when the node is initialized. It's true that a cloud provider can apply custom taints to achieve the same behavior but it will always race against the scheduler for pods unless we add the gate during initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of totally a gross hack. The cloud provider tells the node "I won't be ready until event X happens", knowing full well that event X will never happen, and then just says "never mind, forget I said that" when it's ready. It's not using the defined mechanism, it's abusing it.

I have mixed feelings about this... yes it's not exactly using the feature the way it is fully intended but it satisfies most of the use-cases around readiness - which is waiting for system critical pods (usually DaemonSets) - while also allowing for the external cloud provider case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to a cloud provider specific integration for this where we allow the cloud provider to do initialization in 2 steps - one for node addresses, topology, etc and another for node readiness - but I would rather have a generic solution that fixes this for many other use-cases.

A readiness taint will be used to track the overall readiness of a node. The kubelet, node lifecycle controller, other external controllers will
manage that taint.

The Node object will be extended to include a list of readiness gates like so:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for having completely unique readiness gates for each node, but it seems much more likely that you'd have a small set of distinct node types, where each type had its own readiness gates. And each node type would probably already be distinguishable by its labels. So having something that selects Nodes to apply readiness gates to them seems like it would be easier to manage than something that requires modifying the Nodes (especially since to avoid race conditions, the thing setting the node.Spec.ReadinessGates basically has to be in kubelet itself, whereas if you do it the other way, you can create the selectors well before creating the nodes).

And in fact, it seems like really we already have a thing that selects nodes that need special initialization: DaemonSets. You could just add the BlocksReadiness field to DaemonSetSpec, defaulting to false. And then a node would be "meta-ready" when all readiness-blocking DaemonSets that selected that node had successfully scheduled their pods to that node and the pods reported being ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in fact, it seems like really we already have a thing that selects nodes that need special initialization: DaemonSets. You could just add the BlocksReadiness field to DaemonSetSpec, defaulting to false. And then a node would be "meta-ready" when all readiness-blocking DaemonSets that selected that node had successfully scheduled their pods to that node and the pods reported being ready.

I like this idea but I do think there should still be a way to set the readiness gate outside of DaemonSets also. Maybe for an initial implementation we can only allow readiness gates from cloud providers and from DaemonSets and if other use-cases come up we can add support later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something like blocksReadiness on a DaemonSet be allowed only on namespaces like kube-system? With this, any user with access to DaemonSets could prevent pods from scheduling on the entire cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... yeah, there would need to be some sort of protection there, but it can't be as simple as "limited to kube-system". I'm not sure what approach would be consistent with other features. ("Limited to namespaces with the annotation X?")

// The name of the node readiness gate
// +optional
Name string `json:"name" protobuf:"bytes,1,opt,name=name"`
// Whether a node is ready based on it's corresponding readiness gates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"its". But also, "gates" plural makes it sound like this is the aggregated ready state, but it's not. "Whether the node is ready based on this readiness gate".
And neither field should be +optional right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated.


One limitation in this design is that all system critical pods that should gate the readiness of a node must be updated to tolerate the readiness taint. Otherwise,
those pods cannot be scheduled on a node since the readiness taint would already be applied by the kubelet on registration. If users forget to apply this toleration,
they may accidentally gate the readiness of their nodes, preventing workloads from being scheduled there.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems as though any pod that blocks node readiness would almost necessarily have to tolerate all taints... Otherwise, eg, running low on disk space would cause all the node-readiness-gating pods to be evicted, which would then cause the node to become meta-unready, which would then cause most of the other pods to be evicted (even if they explicitly indicated that they can tolerate low disk space).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IME system critical pods tend to already tolerate all taints - but agreed this can be unexpected in many cases. If we specify readiness gate at the DaemonSet level like you suggested, we could set the right toleration when the DaemonSet is created? That doesn't sound ideal either because a DaemonSet would end up tolerating taints it shouldn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we specify readiness gate at the DaemonSet level like you suggested, we could set the right toleration when the DaemonSet is created?

Or just require the user to create it, and reject it in validation if they didn't.

That doesn't sound ideal either because a DaemonSet would end up tolerating taints it shouldn't.

Maybe there are use cases for a node-readiness-blocking pod to tolerate some but not all taints. Anyway, the validation check could just check that it tolerates at least the meta-ready taint, rather than that it tolerates all taints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation for tolerations sgtm


The following journey attempts to illustrate this solution a bit more in detail:

1. Kubelet starts up on a node and immediately taints the node with a special “kubernetes.io/MetaReadiness” taint with an effect of “NoSchedule” in addition to flipping it’s “Ready” condition to true (when appropriate).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "ApplicationReadiness" rather than "MetaReadiness"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not NodeReadiness?

Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
Co-authored-by: Vish Kannan <vishnuk@google.com>
@lbernail
Copy link

lbernail commented Oct 4, 2019

@andrewsykim do you think the KEP could be approved for the next release?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I agree with the premise of this. I have many thoughts about the design.

Second, it might be nice to enumerate even more concretely the use cases. Which things that we currently do would be replaced by this? Which things that people WANT to do would this enable? The scope of the list will help justify the complexity tradeoffs.

It's not quite like pod readiness, which uses a list of condition names. Why not? E.g. Every component registers a condition name in node.spec.readinessGates. Kubelet applies a taint until all conditions are present and True. This would force daemonsets to have a cluster-controller that tests whether that DS is present and ready on a given node and update the condition. Kind of bleh. As a refinement, each readiness gate could be a condition name OR a pod selector (or the name of a daemonset, which already has a selector).

I don't think modifying readiness gates on the fly for non-daemonset conditions is going to fly past API review.

As an alternative, why do we need more than taints? Let each readiness gate apply a taint on the node. NetworkNotReady, LoggingNotEnabled, etc. Why is that not sufficient?

Is this just about Node startup or are we talking about a real-time signal? E.g. is it a one-way, latched transition (node starts unready and becomes ready) or bi-directional (continuously evaluated)? If a critical daemonset becomes unready, does the taint come back? What happens to running pods which are, presumably, malfunctioning now?

It's much simpler if we assume, like pod readiness, that it's a "birth time" one-way flip. Can we get away with that?

I have to admit - you lost me with the discussion of special leases.

Is this something we want to press on in the new year? Maybe we should do a realtime discussion on it?

@lbernail
Copy link

Hi @thockin thank you for having a look at this KEP. We are very interested in this feature (we are currently working on using https://github.com/uswitch/nidhogg, which works fine but requires some work to manage the scale we run at).

A few example use-cases for us:

  • initContainers failing to work because kube-proxy / node-local-dns is not working yet
  • missing cloud credentials at application startup (for kube2iam/kiam and similar solutions)
  • missing traces / metrics at application startups (for applications sending those to an agent running as a daemonset)

InitContainers failing is not a very big problem because they work after a few retries. The other 2 are harder and several applications implement retry logic to address this (which of course they will have to do anyway in case the pod providing the service fails during the application lifetime). The hardest scenario is missing cloud credentials because most sdks will consider they don't exist and will continue without looking for them (if such a pod fails later, it's less of a problem because credentials are valid for some time and sdks will retry).

Having a solution for node start-up only would address most of our problems

ReadinessGates []NodeReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,7,opt,name=readinessGates"`
}

// NodeReadinessGate indicates a set of pods that must be ready on a given node before the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the selector may select more than one pod, what is the expected behavior if not all of those selected pods are ready?
Is the gate ready if at least one of the selected pods is ready?
Or if all of the selected pods are ready?

If it's all, then this requires first calculating the set of system-critical daemonset pods that are expected to be scheduled on the node, before trying to evaluate the node readiness. And kubelet can't do that, only the scheduler can. So the all semantics would require the cooperation of the scheduler to determine node readiness.

In any case, this has to be clarified here.

}
```

For each `NodeReadinessGate`, kubelet will fetch the state of all pods that match its label selector and take the readiness state of those pods into account.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think kubelet alone can determine the readiness, because readiness will depend on the set of system-critical daemonset pods that are to be scheduled on the node, and that is determined by the scheduler.
So I think this requires cooperation from the scheduler.
Otherwise, kubelet may declare the node ready even though a critical daemonset has not yet been scheduled to run on the node.

...
// Defines the list of readiness gates for a node
// +optional
ReadinessGates []NodeReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,7,opt,name=readinessGates"`
Copy link

@rlenglet rlenglet Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ignore the implementation details for a moment, this API is not necessary.
The set of pods to wait for is already clearly specified by users: it is the set of pods with a system-cluster-critical or system-node-critical priorityClass.
We shouldn't need any additional input.

I feel that this ReadinessGates API would just be used to replicate the information already specified in priorityClass: if a node has a critical priorityClass it should be matched by a ReadinessGates, and vice-versa I can't think of a case where a pod matched by a ReadinessGates shouldn't have a critical priorityClass.

It seems that introducing ReadinessGates is driven by the desire to implement this 100% in kubelet. However, if the implementation involved the scheduler to determine the set of pods with a critical priorityClass to be scheduled on the node, that new API wouldn't be necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @vishh. I was making the assumption that we'd have to deal with only one scheduler, because I was considering only daemonset pods. But if we want to deal with pods other than daemonsets, which are scheduled by different schedulers, then this becomes complicated because all schedulers would need to support this logic.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020
@andrewsykim
Copy link
Member Author

This needs to be fleshed out more, I'll try to get to it in the next week or so but no guarantees. If other folks want to contribute please reach out :)

@andrewsykim
Copy link
Member Author

I've been steeping on this for a bit and I think a "conditions/taints hook" from the cloud controllers is sufficient enough for the use-cases we need (at least for the cloud providers). The most common use-case I hear is preventing nodes from accepting new pods before the cloud provider registers routes for that node. We can allow cloud providers to add a "no routes" taint (or use the existing NetworkUnavailable node condition) assuming the routes controller will remove it later. Worth noting that the GCP provider does this today but via the kubelet. We should make this functionality available to all the external providers. I will open a separate KEP for this use-case.

Cloud providers aside, it seems like there are other valid use-cases (e.g. waiting for system critical pods) worth considering but all of these can be addressed today by passing a custom taint to kubelet (via --register-with-taints) and having a custom controller (or daemonset) to remove the taint later. Although this is probably asking a lot from users, it seems like the right approach (at least for now) since expressing all the possible conditions to block readiness in Node seems unlikely at this point. There are just too many factors/scenarios to consider.

/close

@k8s-ci-robot
Copy link
Contributor

@andrewsykim: Closed this PR.

In response to this:

I've been steeping on this for a bit and I think a "conditions/taints hook" from the cloud controllers is sufficient enough for the use-cases we need (at least for the cloud providers). The most common use-case I hear is preventing nodes from accepting new pods before the cloud provider registers routes for that node. We can allow cloud providers to add a "no routes" taint (or use the existing NetworkUnavailable node condition) assuming the routes controller will remove it later. Worth noting that the GCP provider does this today but via the kubelet. We should make this functionality available to all the external providers. I will open a separate KEP for this use-case.

Cloud providers aside, it seems like there are other valid use-cases (e.g. waiting for system critical pods) worth considering but all of these can be addressed today by passing a custom taint to kubelet (via --register-with-taints) and having a custom controller (or daemonset) to remove the taint later. Although this is probably asking a lot from users, it seems like the right approach (at least for now) since expressing all the possible conditions to block readiness in Node seems unlikely at this point. There are just too many factors/scenarios to consider.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.