Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter should taint nodes on startup with karpenter.sh/unregistered:NoSchedule and remove once the node is registered #1049

Closed
jonathan-innis opened this issue Feb 28, 2024 · 8 comments · Fixed by #1336 or aws/karpenter-provider-aws#6388
Labels
kind/feature Categorizes issue or PR as related to a new feature. operational-excellence v1 Issues requiring resolution by the v1 milestone

Comments

@jonathan-innis
Copy link
Member

Description

What problem are you trying to solve?

Karpenter currently doesn't taint nodes in its userData, but it chooses to propagate some labels, annotations, etc. down to nodes when they join the cluster. There is currently a race that is present between the Karpenter controller's apply of the labels to the node and the kube-scheduler being able to bind pods against the node. It's possible that Karpenter may not have applied labels yet when the kube-scheduler determines that it can start to schedule pods against the node.

This problem is more prevalent with DaemonSets, where they can tolerate the node.kubernetes.io/not-ready taint, meaning that they can be scheduled against the node as soon as it joins. This is definitely less of a problem for application pods that have to wait for the node to go ready, but can still be an issue if the Karpenter controller is down.

This also isn't a large issue, in general, since the Karpenter cloudproviders will generally pass all of the labels down from a NodeClaim into the relevant startup script that is passed to the kubelet. This means that those labels will be there atomically on kubelet start, which, again, removes the chance that this race can occur.

Realistically, we shoudn't have to rely on this behavior and should just take a stance, similar to the external cloudprovider implementation with its taint that a node gets tainted on startup with a known taint and Karpenter removes the taint from the node once it has finished its registration process.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added operational-excellence kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone labels Feb 28, 2024
@jonathan-innis jonathan-innis changed the title Karpenter should taint nodes on startup with karpenter.sh/unregistered and remove once the node is registered Karpenter should taint nodes on startup with karpenter.sh/unregistered:NoSchedule and remove once the node is registered Feb 28, 2024
@jonathan-innis
Copy link
Member Author

There is some relevant conversation around this topic that I found here as I was exploring what it would look like if Karpenter were to allow users to specify the node-role.kubernetes.io label in the NodePool configuration.

@sftim
Copy link

sftim commented Mar 1, 2024

  • I think it's reasonable to taint with the NoExecute form as well. Pods that want to tolerate it can do.
  • We can emit metrics about how many nodes we launched and how many nodes made it to the point that we successfully removed this taint from that node (eg, per NodePool). Those counters would help make node launch failures more visible.

@jonathan-innis
Copy link
Member Author

I think it's reasonable to taint with the NoExecute form as well

Do we know why the external cloudprovider implementation uses the NoSchedule taint vs. the NoExecute taint. I'd imagine that it realistically doesn't matter much if you're adding the taint on creation of the node since there's nothing that's going to be schedule there anyways.

@fmuyassarov
Copy link
Member

Hi @jonathan-innis . Can I offer help to work on this task ?

@rschalo
Copy link
Contributor

rschalo commented Jun 7, 2024

Do we know why the external cloudprovider implementation uses the NoSchedule taint vs. the NoExecute taint.

As far as I can tell, the NoSchedule taint more or less has the same behavior as NoExecute except that NoExecute is more flexible and evicts current pods with the toleration. If racing is a concern then doesn't the NoExecute taint make more sense because of that eviction behavior?

@jonathan-innis
Copy link
Member Author

I'm not sure it really makes a huge difference because the taint exists on creation so nothing should schedule anyways.

@rschalo
Copy link
Contributor

rschalo commented Jun 7, 2024

Fair enough, in that case then I think 'NoEffect' still works because if a cx would want a pod to tolerate the taint then they have more control over how with access to 'tolerationSeconds'. WDYT?

@jonathan-innis
Copy link
Member Author

jonathan-innis commented Jun 8, 2024

If the pod is never scheduled to the node in the first place, I don't think it makes any difference. Seems fine either way to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. operational-excellence v1 Issues requiring resolution by the v1 milestone
Projects
None yet
4 participants