Karpenter not re-evaluating pending pods #4392

gomesdigital · 2023-08-07T05:23:55Z

Description

Observed Behavior:
We see Karpenter nominating a particular node for pending pods, but the pods are never actually scheduled there. Pods are then left pending indefinitely. There are also no significant taints or selectors involved and the stuck pods are a completely random set. It isn't occurring for all pods - we usually find the issue to occur on scale-in/out.

We speculate there are at least two possible scenarios that could make this happen:

The node that Karpenter has nominated becomes full before the pending pod is scheduled. e.g. Another node is spot interrupted and it's workloads immediately move to the nominated node.
The node that Karpenter has nominated is consolidated before the pending pod is scheduled. ( similar to Karpenter consolidate new node before actual pod is started kubernetes-sigs/karpenter#685 )

In both these scenarios it seems that Karpenter needs to re-evaluate the pod.

Provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  consolidation:
    enabled: true
  kubeletConfiguration:
    systemReserved:
      cpu: 100m
      memory: 100Mi
      ephemeral-storage: 1Gi
  limits:
    resources:
      cpu: 2500
  providerRef:
    name: default
  requirements:
    - key: kubernetes.io/arch
      operator: In
      values: ['amd64', 'arm64']
    - key: karpenter.sh/capacity-type
      operator: In
      values: ['spot', 'on-demand']
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values: ['c', 'm', 'r']
    - key: karpenter.k8s.aws/instance-family
      operator: NotIn
      values: ['m5', 'm5d', 'm5n', 'm5dn', 'm5a', 'm5ad', 'r5', 'r5d', 'r5n', 'r5dn', 'r5b', 'r5a', 'r5ad']
    - key: karpenter.k8s.aws/instance-generation
      operator: Gt
      values: ['4']
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: ['2xlarge','4xlarge']
    - key: topology.kubernetes.io/zone
      operator: In
      values: ['eu-central-1a']

Pod events:

  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  6m1s (x4 over 6m31s)  default-scheduler  0/13 nodes are available: 1 node(s) had untolerated taint {monitoring: }, 1 node(s) were unschedulable, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 4 Insufficient memory, 8 Insufficient cpu. preemption: 0/13 nodes are available: 4 Preemption is not helpful for scheduling, 9 No preemption victims found for incoming pod..
  Warning  FailedScheduling  5m3s (x2 over 5m14s)  default-scheduler  0/12 nodes are available: 1 node(s) had untolerated taint {monitoring: }, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 4 Insufficient memory, 8 Insufficient cpu. preemption: 0/12 nodes are available: 3 Preemption is not helpful for scheduling, 9 No preemption victims found for incoming pod..
  Normal   Nominated         30s (x4 over 6m30s)   karpenter          Pod should schedule on: machine/default-5v524, node/ip-X-Y-16-107.region.compute.internal
  ... FailedScheduling continues ...

Node events:

  Type     Reason                   Age                From                   Message
  ----     ------                   ----               ----                   -------
  Normal   Starting                 10m                kube-proxy             
  Normal   Starting                 11m                kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      11m                kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  11m (x2 over 11m)  kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    11m (x2 over 11m)  kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     11m (x2 over 11m)  kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  11m                kubelet                Updated Node Allocatable limit across pods
  Normal   Synced                   10m                cloud-node-controller  Node synced successfully
  Normal   RegisteredNode           10m                node-controller        Node ip-X-Y-16-107.region.compute.internal event: Registered Node ip-X-Y-16-107.region.compute.internal in Controller
  Normal   DeprovisioningBlocked    10m                karpenter              Cannot deprovision node due to machine is not initialized
  Normal   NodeReady                10m                kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeReady
  Normal   DeprovisioningBlocked    40s (x5 over 10m)  karpenter              Cannot deprovision node due to machine is nominated

Karpenter logs

2023-08-07T03:39:30.799Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-5v524", "provisioner": "default", "provider-id": "aws:///regiona/i-0027fc6792f2c019d", "node": "ip-X-Y-16-107.region.compute.internal"} 2023-08-07T03:39:31.207Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-0adc52b170198b060", "name": "karpenter.k8s.aws/13153237635562061995"} 2023-08-07T03:39:31.297Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-082261bbc5b2f3567", "name": "karpenter.k8s.aws/16969462482242265312"} 2023-08-07T03:39:31.393Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-0323af4e0542c0cd3", "name": "karpenter.k8s.aws/15163484215685057987"} 2023-08-07T03:39:31.482Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-0469c9ca84954ab85", "name": "karpenter.k8s.aws/8039010270740481828"} 2023-08-07T03:39:39.865Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:39:41.233Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["metrics-api"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/metrics-api-86f7647844-kj8k6"} 2023-08-07T03:39:43.123Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-5v524", "provisioner": "default", "provider-id": "aws:///regiona/i-0027fc6792f2c019d", "node": "ip-X-Y-16-107.region.compute.internal"} 2023-08-07T03:40:07.496Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-X-Y-61-95.region.compute.internal/r6gd.2xlarge/spot {"commit": "34d50bf-dirty"} 2023-08-07T03:40:07.563Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-61-95.region.compute.internal"} 2023-08-07T03:40:59.669Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-61-95.region.compute.internal"} 2023-08-07T03:40:59.981Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-dcsdj", "node": "ip-X-Y-61-95.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-022a9d3392d40894e"} 2023-08-07T03:41:19.221Z INFO controller.deprovisioning deprovisioning via consolidation replace, terminating 1 machines ip-X-Y-56-56.region.compute.internal/m6g.4xlarge/on-demand and replacing with spot machine from types m6gd.4xlarge, r6g.4xlarge, r6i.4xlarge, r6id.4xlarge, c5n.4xlarge and 5 other(s) {"commit": "34d50bf-dirty"} 2023-08-07T03:41:19.316Z INFO controller.deprovisioning created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"14090m","memory":"32267Mi","pods":"27"}, "instance-types": "c5n.4xlarge, m6a.4xlarge, m6g.4xlarge, m6gd.4xlarge, m6i.4xlarge and 5 other(s)"} 2023-08-07T03:41:19.634Z DEBUG controller.machine.lifecycle discovered subnets {"commit": "34d50bf-dirty", "machine": "default-pwqrw", "provisioner": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T03:41:22.480Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-pwqrw", "provisioner": "default", "provider-id": "aws:///regiona/i-072805f5ce8653824", "instance-type": "r6gd.4xlarge", "zone": "regiona", "capacity-type": "spot", "allocatable": {"cpu":"15790m","ephemeral-storage":"16Gi","memory":"118153Mi","pods":"234"}} 2023-08-07T03:41:44.226Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-pwqrw", "provisioner": "default", "provider-id": "aws:///regiona/i-072805f5ce8653824", "node": "ip-X-Y-49-9.region.compute.internal"} 2023-08-07T03:41:56.921Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-pwqrw", "provisioner": "default", "provider-id": "aws:///regiona/i-072805f5ce8653824", "node": "ip-X-Y-49-9.region.compute.internal"} 2023-08-07T03:42:03.567Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-56-56.region.compute.internal"} 2023-08-07T03:43:29.335Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-56-56.region.compute.internal"} 2023-08-07T03:43:29.646Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-2wfn5", "node": "ip-X-Y-56-56.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-07cb0d89e8ce28563"} 2023-08-07T03:43:37.909Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-tlslj"} ... 2023-08-07T03:44:22.832Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["metrics-api"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/metrics-api-86f7647844-kj8k6"} 2023-08-07T03:44:54.706Z DEBUG controller.awsnodetemplate discovered subnets {"commit": "34d50bf-dirty", "awsnodetemplate": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T03:49:28.248Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-tlslj"} ... 2023-08-07T03:49:30.217Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["metrics-api"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/metrics-api-86f7647844-kj8k6"} 2023-08-07T03:49:31.205Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-0774941c841d7721c", "name": "karpenter.k8s.aws/18323470363648805825"} 2023-08-07T03:49:31.297Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-01eb40992bacab8b1", "name": "karpenter.k8s.aws/2781562212885794394"} 2023-08-07T03:49:31.401Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-03c463d42935c6593", "name": "karpenter.k8s.aws/9323694851113280840"} 2023-08-07T03:49:31.492Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-03f6cd48dc7a43192", "name": "karpenter.k8s.aws/15790409320365474551"} 2023-08-07T03:49:55.373Z DEBUG controller.awsnodetemplate discovered subnets {"commit": "34d50bf-dirty", "awsnodetemplate": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T03:54:51.436Z INFO controller.deprovisioning deprovisioning via consolidation replace, terminating 2 machines ip-X-Y-26-11.region.compute.internal/c5a.4xlarge/on-demand, ip-X-Y-7-93.region.compute.internal/r6gd.2xlarge/spot and replacing with on-demand machine from types c7g.4xlarge, c6g.4xlarge {"commit": "34d50bf-dirty"} 2023-08-07T03:54:51.558Z INFO controller.deprovisioning created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"12515m","memory":"26923Mi","pods":"26"}, "instance-types": "c6g.4xlarge, c7g.4xlarge"} 2023-08-07T03:54:51.673Z DEBUG controller.machine.lifecycle discovered subnets {"commit": "34d50bf-dirty", "machine": "default-8f5jm", "provisioner": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T03:54:52.291Z DEBUG controller.machine.lifecycle created launch template {"commit": "34d50bf-dirty", "machine": "default-8f5jm", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/16969462482242265312", "id": "lt-06b71cc326740b7d7"} 2023-08-07T03:54:52.752Z DEBUG controller.provisioner waiting on cluster sync {"commit": "34d50bf-dirty"} 2023-08-07T03:54:54.039Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-8f5jm", "provisioner": "default", "provider-id": "aws:///regiona/i-09041929855182f45", "instance-type": "c6g.4xlarge", "zone": "regiona", "capacity-type": "on-demand", "allocatable": {"cpu":"15790m","ephemeral-storage":"16Gi","memory":"27222Mi","pods":"234"}} 2023-08-07T03:55:15.579Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-8f5jm", "provisioner": "default", "provider-id": "aws:///regiona/i-09041929855182f45", "node": "ip-X-Y-43-164.region.compute.internal"} 2023-08-07T03:55:28.187Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-8f5jm", "provisioner": "default", "provider-id": "aws:///regiona/i-09041929855182f45", "node": "ip-X-Y-43-164.region.compute.internal"} 2023-08-07T03:55:35.830Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-26-11.region.compute.internal"} 2023-08-07T03:55:35.855Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-7-93.region.compute.internal"} 2023-08-07T03:55:39.826Z INFO controller.provisioner found provisionable pod(s) {"commit": "34d50bf-dirty", "pods": 16} 2023-08-07T03:55:39.826Z INFO controller.provisioner computed new machine(s) to fit pod(s) {"commit": "34d50bf-dirty", "machines": 1, "pods": 13} 2023-08-07T03:55:39.826Z INFO controller.provisioner computed 3 unready node(s) will fit 3 pod(s) {"commit": "34d50bf-dirty"} 2023-08-07T03:55:39.836Z INFO controller.provisioner created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"6985m","memory":"16923Mi","pods":"19"}, "instance-types": "c5.4xlarge, c5a.4xlarge, c5ad.4xlarge, c5d.4xlarge, c5n.2xlarge and 27 other(s)"} 2023-08-07T03:55:40.027Z DEBUG controller.machine.lifecycle created launch template {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/15790409320365474551", "id": "lt-0920d5c0c501a815f"} 2023-08-07T03:55:40.159Z DEBUG controller.machine.lifecycle created launch template {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/9323694851113280840", "id": "lt-0f76e882a8c5222e7"} 2023-08-07T03:55:40.303Z DEBUG controller.machine.lifecycle created launch template {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/2781562212885794394", "id": "lt-0101dad20a4c23bea"} 2023-08-07T03:55:40.458Z DEBUG controller.machine.lifecycle created launch template {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/18323470363648805825", "id": "lt-0acec33ea13b62d15"} 2023-08-07T03:55:41.531Z DEBUG controller.provisioner waiting on cluster sync {"commit": "34d50bf-dirty"} 2023-08-07T03:55:42.466Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "provisioner": "default", "provider-id": "aws:///regiona/i-0979c299232ecedaf", "instance-type": "r6gd.2xlarge", "zone": "regiona", "capacity-type": "spot", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"59468Mi","pods":"58"}} 2023-08-07T03:55:58.268Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-7-93.region.compute.internal"} 2023-08-07T03:55:58.269Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-26-11.region.compute.internal"} 2023-08-07T03:55:58.570Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-5zrn9", "node": "ip-X-Y-7-93.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-0aabf6f81f91f5f26"} 2023-08-07T03:55:58.572Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-t5wg2", "node": "ip-X-Y-26-11.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-09cd7ef42da828d3b"} 2023-08-07T03:56:00.526Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:56:00.919Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["tap-injector"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/tap-injector-84c68fc689-4fxf4"} 2023-08-07T03:56:05.009Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "provisioner": "default", "provider-id": "aws:///regiona/i-0979c299232ecedaf", "node": "ip-X-Y-20-234.region.compute.internal"} 2023-08-07T03:56:12.333Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:56:12.421Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["tap-injector"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/tap-injector-84c68fc689-4fxf4"} 2023-08-07T03:56:17.506Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "provisioner": "default", "provider-id": "aws:///regiona/i-0979c299232ecedaf", "node": "ip-X-Y-20-234.region.compute.internal"} 2023-08-07T03:56:17.623Z INFO controller.provisioner found provisionable pod(s) {"commit": "34d50bf-dirty", "pods": 1} 2023-08-07T03:56:17.623Z INFO controller.provisioner computed new machine(s) to fit pod(s) {"commit": "34d50bf-dirty", "machines": 1, "pods": 1} 2023-08-07T03:56:17.639Z INFO controller.provisioner created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"2650m","memory":"2389Mi","pods":"7"}, "instance-types": "c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c5ad.2xlarge and 38 other(s)"} 2023-08-07T03:56:17.777Z DEBUG controller.machine.lifecycle discovered subnets {"commit": "34d50bf-dirty", "machine": "default-9bf66", "provisioner": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T03:56:19.311Z DEBUG controller.provisioner waiting on cluster sync {"commit": "34d50bf-dirty"} 2023-08-07T03:56:20.629Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-9bf66", "provisioner": "default", "provider-id": "aws:///regiona/i-0205c08f3e7cf4037", "instance-type": "r6gd.2xlarge", "zone": "regiona", "capacity-type": "spot", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"59468Mi","pods":"58"}} 2023-08-07T03:56:23.430Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:56:23.437Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["tap-injector"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/tap-injector-84c68fc689-4fxf4"} 2023-08-07T03:56:41.782Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-9bf66", "provisioner": "default", "provider-id": "aws:///regiona/i-0205c08f3e7cf4037", "node": "ip-X-Y-40-78.region.compute.internal"} 2023-08-07T03:56:44.936Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:56:45.223Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["metrics-api"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/metrics-api-86f7647844-kj8k6"} 2023-08-07T03:56:54.033Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-9bf66", "provisioner": "default", "provider-id": "aws:///regiona/i-0205c08f3e7cf4037", "node": "ip-X-Y-40-78.region.compute.internal"} 2023-08-07T03:57:11.919Z INFO controller.deprovisioning deprovisioning via consolidation replace, terminating 1 machines ip-X-Y-38-180.region.compute.internal/c5a.4xlarge/on-demand and replacing with on-demand machine from types m6a.2xlarge, c5n.2xlarge, m6id.2xlarge, c5d.2xlarge, c6i.2xlarge and 7 other(s) {"commit": "34d50bf-dirty"} 2023-08-07T03:57:11.949Z INFO controller.deprovisioning created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"7360m","memory":"10595Mi","pods":"13"}, "instance-types": "c5.2xlarge, c5a.2xlarge, c5ad.2xlarge, c5d.2xlarge, c5n.2xlarge and 7 other(s)"} 2023-08-07T03:57:12.468Z DEBUG controller.machine.lifecycle created launch template {"commit": "34d50bf-dirty", "machine": "default-vjfx6", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/15163484215685057987", "id": "lt-0478331164a2b8f3d"} 2023-08-07T03:57:14.173Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-vjfx6", "provisioner": "default", "provider-id": "aws:///regiona/i-0747ef27539a821e7", "instance-type": "c5a.2xlarge", "zone": "regiona", "capacity-type": "on-demand", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"14062Mi","pods":"58"}} 2023-08-07T03:57:33.853Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-vjfx6", "provisioner": "default", "provider-id": "aws:///regiona/i-0747ef27539a821e7", "node": "ip-X-Y-47-1.region.compute.internal"} 2023-08-07T03:57:46.524Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-vjfx6", "provisioner": "default", "provider-id": "aws:///regiona/i-0747ef27539a821e7", "node": "ip-X-Y-47-1.region.compute.internal"} 2023-08-07T03:57:56.111Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-38-180.region.compute.internal"} 2023-08-07T03:57:59.384Z INFO controller.provisioner found provisionable pod(s) {"commit": "34d50bf-dirty", "pods": 3} 2023-08-07T03:57:59.384Z INFO controller.provisioner computed new machine(s) to fit pod(s) {"commit": "34d50bf-dirty", "machines": 1, "pods": 3} 2023-08-07T03:57:59.393Z INFO controller.provisioner created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"3860m","memory":"6275Mi","pods":"9"}, "instance-types": "c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c5ad.2xlarge and 22 other(s)"} 2023-08-07T03:57:59.522Z DEBUG controller.machine.lifecycle discovered subnets {"commit": "34d50bf-dirty", "machine": "default-v7kqj", "provisioner": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T03:58:01.976Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-v7kqj", "provisioner": "default", "provider-id": "aws:///regiona/i-0ab792dc0cca9b3b4", "instance-type": "c6id.2xlarge", "zone": "regiona", "capacity-type": "spot", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"14062Mi","pods":"58"}} 2023-08-07T03:58:20.964Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-v7kqj", "provisioner": "default", "provider-id": "aws:///regiona/i-0ab792dc0cca9b3b4", "node": "ip-X-Y-43-172.region.compute.internal"} 2023-08-07T03:58:25.512Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-38-180.region.compute.internal"} 2023-08-07T03:58:25.774Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-gxz6b", "node": "ip-X-Y-38-180.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-0d29698b393dc79bd"} 2023-08-07T03:58:31.012Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-v7kqj", "provisioner": "default", "provider-id": "aws:///regiona/i-0ab792dc0cca9b3b4", "node": "ip-X-Y-43-172.region.compute.internal"} 2023-08-07T03:58:46.385Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-X-Y-20-234.region.compute.internal/r6gd.2xlarge/spot {"commit": "34d50bf-dirty"} 2023-08-07T03:58:46.445Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-20-234.region.compute.internal"} 2023-08-07T03:58:49.765Z INFO controller.provisioner found provisionable pod(s) {"commit": "34d50bf-dirty", "pods": 3} 2023-08-07T03:58:49.765Z INFO controller.provisioner computed new machine(s) to fit pod(s) {"commit": "34d50bf-dirty", "machines": 1, "pods": 1} 2023-08-07T03:58:49.765Z INFO controller.provisioner computed 2 unready node(s) will fit 2 pod(s) {"commit": "34d50bf-dirty"} 2023-08-07T03:58:49.777Z INFO controller.provisioner created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"1500m","memory":"2261Mi","pods":"7"}, "instance-types": "c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c5ad.2xlarge and 38 other(s)"} 2023-08-07T03:58:52.221Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-94xdg", "provisioner": "default", "provider-id": "aws:///regiona/i-0f3dd2c31af754f6c", "instance-type": "r6gd.2xlarge", "zone": "regiona", "capacity-type": "spot", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"59468Mi","pods":"58"}} 2023-08-07T03:59:09.929Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-20-234.region.compute.internal"} 2023-08-07T03:59:10.217Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-vvjvp", "node": "ip-X-Y-20-234.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-0979c299232ecedaf"} 2023-08-07T03:59:10.939Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:59:11.536Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["metrics-api"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/metrics-api-86f7647844-kj8k6"} 2023-08-07T03:59:12.076Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-94xdg", "provisioner": "default", "provider-id": "aws:///regiona/i-0f3dd2c31af754f6c", "node": "ip-X-Y-15-143.region.compute.internal"} 2023-08-07T03:59:21.926Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:59:22.512Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["metrics-api"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/metrics-api-86f7647844-kj8k6"} 2023-08-07T03:59:23.793Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-94xdg", "provisioner": "default", "provider-id": "aws:///regiona/i-0f3dd2c31af754f6c", "node": "ip-X-Y-15-143.region.compute.internal"} 2023-08-07T03:59:31.202Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-0478331164a2b8f3d", "name": "karpenter.k8s.aws/15163484215685057987"} 2023-08-07T03:59:31.293Z DEBUG controller deleted launch template {"commit": "34d50bf-dirty", "id": "lt-06b71cc326740b7d7", "name": "karpenter.k8s.aws/16969462482242265312"} 2023-08-07T03:59:32.718Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-n877s"} ... 2023-08-07T03:59:33.115Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["tap-injector"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/tap-injector-84c68fc689-4fxf4"} 2023-08-07T03:59:48.509Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-X-Y-40-78.region.compute.internal/r6gd.2xlarge/spot {"commit": "34d50bf-dirty"} 2023-08-07T03:59:48.574Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-40-78.region.compute.internal"} 2023-08-07T03:59:51.577Z INFO controller.provisioner found provisionable pod(s) {"commit": "34d50bf-dirty", "pods": 3} 2023-08-07T03:59:51.577Z INFO controller.provisioner computed new machine(s) to fit pod(s) {"commit": "34d50bf-dirty", "machines": 1, "pods": 1} 2023-08-07T03:59:51.577Z INFO controller.provisioner computed 1 unready node(s) will fit 2 pod(s) {"commit": "34d50bf-dirty"} 2023-08-07T03:59:51.588Z INFO controller.provisioner created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"2650m","memory":"2389Mi","pods":"7"}, "instance-types": "c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c5ad.2xlarge and 38 other(s)"} 2023-08-07T03:59:51.723Z DEBUG controller.machine.lifecycle discovered subnets {"commit": "34d50bf-dirty", "machine": "default-4lxp8", "provisioner": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T03:59:54.435Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-4lxp8", "provisioner": "default", "provider-id": "aws:///regiona/i-05c5a13ed53b6d9a4", "instance-type": "r6gd.2xlarge", "zone": "regiona", "capacity-type": "spot", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"59468Mi","pods":"58"}} 2023-08-07T04:00:17.495Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-4lxp8", "provisioner": "default", "provider-id": "aws:///regiona/i-05c5a13ed53b6d9a4", "node": "ip-X-Y-2-238.region.compute.internal"} 2023-08-07T04:00:18.052Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-40-78.region.compute.internal"} 2023-08-07T04:00:18.400Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-9bf66", "node": "ip-X-Y-40-78.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-0205c08f3e7cf4037"} 2023-08-07T04:00:24.222Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"linkerd.io/control-plane-component","operator":"In","values":["destination"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd/linkerd-destination-5c6bc8648b-66pk5"} ... 2023-08-07T04:00:24.240Z DEBUG controller.deprovisioning relaxing soft constraints for pod since it previously failed to schedule, removing: spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0]={"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"component","operator":"In","values":["metrics-api"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"}} {"commit": "34d50bf-dirty", "pod": "linkerd-viz/metrics-api-86f7647844-kj8k6"} 2023-08-07T04:00:29.249Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-4lxp8", "provisioner": "default", "provider-id": "aws:///regiona/i-05c5a13ed53b6d9a4", "node": "ip-X-Y-2-238.region.compute.internal"} 2023-08-07T04:00:50.326Z INFO controller.deprovisioning deprovisioning via consolidation replace, terminating 1 machines ip-X-Y-47-1.region.compute.internal/c5a.2xlarge/on-demand and replacing with on-demand machine from types c7g.2xlarge, c6g.2xlarge {"commit": "34d50bf-dirty"} 2023-08-07T04:00:50.359Z INFO controller.deprovisioning created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"2250m","memory":"2285Mi","pods":"8"}, "instance-types": "c6g.2xlarge, c7g.2xlarge"} 2023-08-07T04:00:50.563Z DEBUG controller.machine.lifecycle created launch template {"commit": "34d50bf-dirty", "machine": "default-2g8tb", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/13153237635562061995", "id": "lt-0e0a68a68b5a3c8f9"} 2023-08-07T04:00:52.148Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-2g8tb", "provisioner": "default", "provider-id": "aws:///regiona/i-0cc6f51751fbd27cf", "instance-type": "c6g.2xlarge", "zone": "regiona", "capacity-type": "on-demand", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"14003Mi","pods":"58"}} 2023-08-07T04:01:17.224Z DEBUG controller.machine.lifecycle registered machine {"commit": "34d50bf-dirty", "machine": "default-2g8tb", "provisioner": "default", "provider-id": "aws:///regiona/i-0cc6f51751fbd27cf", "node": "ip-X-Y-60-127.region.compute.internal"} 2023-08-07T04:01:29.956Z DEBUG controller.machine.lifecycle initialized machine {"commit": "34d50bf-dirty", "machine": "default-2g8tb", "provisioner": "default", "provider-id": "aws:///regiona/i-0cc6f51751fbd27cf", "node": "ip-X-Y-60-127.region.compute.internal"} 2023-08-07T04:01:34.493Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-47-1.region.compute.internal"} 2023-08-07T04:01:56.625Z INFO controller.termination deleted node {"commit": "34d50bf-dirty", "node": "ip-X-Y-47-1.region.compute.internal"} 2023-08-07T04:01:56.925Z INFO controller.machine.termination deleted machine {"commit": "34d50bf-dirty", "machine": "default-vjfx6", "node": "ip-X-Y-47-1.region.compute.internal", "provisioner": "default", "provider-id": "aws:///regiona/i-0747ef27539a821e7"} 2023-08-07T04:02:14.713Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-X-Y-15-143.region.compute.internal/r6gd.2xlarge/spot {"commit": "34d50bf-dirty"} 2023-08-07T04:02:14.772Z INFO controller.termination cordoned node {"commit": "34d50bf-dirty", "node": "ip-X-Y-15-143.region.compute.internal"} 2023-08-07T04:02:17.724Z INFO controller.provisioner found provisionable pod(s) {"commit": "34d50bf-dirty", "pods": 3} 2023-08-07T04:02:17.724Z INFO controller.provisioner computed new machine(s) to fit pod(s) {"commit": "34d50bf-dirty", "machines": 1, "pods": 2} 2023-08-07T04:02:17.724Z INFO controller.provisioner computed 1 unready node(s) will fit 1 pod(s) {"commit": "34d50bf-dirty"} 2023-08-07T04:02:17.821Z INFO controller.provisioner created machine {"commit": "34d50bf-dirty", "provisioner": "default", "requests": {"cpu":"2790m","memory":"4373Mi","pods":"8"}, "instance-types": "c5.2xlarge, c5.4xlarge, c5a.2xlarge, c5a.4xlarge, c5ad.2xlarge and 38 other(s)"} 2023-08-07T04:02:18.021Z DEBUG controller.machine.lifecycle discovered subnets {"commit": "34d50bf-dirty", "machine": "default-rsfsk", "provisioner": "default", "subnets": ["subnet-0172226b966a9d7d7 (regiona)"]} 2023-08-07T04:02:20.707Z INFO controller.machine.lifecycle launched machine {"commit": "34d50bf-dirty", "machine": "default-rsfsk", "provisioner": "default", "provider-id": "aws:///regiona/i-0050f6b3cba1f6d35", "instance-type": "r6gd.2xlarge", "zone": "regiona", "capacity-type": "spot", "allocatable": {"cpu":"7810m","ephemeral-storage":"16Gi","memory":"59468Mi","pods":"58"}}

Expected Behavior:
The pending pod should be scheduled and spawn on the node that was nominated, or if that fails a new node should be nominated and the pod should schedule on it instead.

Reproduction Steps (Please include YAML):
We were only able to reproduce the issue by unwiring our workloads completely and then re-applying them back to the cluster, causing node churn. The issue is otherwise intermittent.

Versions:

Chart Version: v0.29.2
Kubernetes Version (kubectl version): v1.27.3-eks-a5565ad

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

tzneal · 2023-08-07T16:39:11Z

When this happens, can you run kubectl describe pod pod-name on the pod? and then run kubectl describe node node-name on the node that we think it will schedule to?

The only thing I'm aware of that causes this is if we launch a node that doesn't go ready, but that node should now be removed after 15 minutes of failing to register.

gomesdigital · 2023-08-07T23:20:45Z

In our case the node does go ready and it runs other workloads.

Here are the full details:

Describe pod

Name:             pod-c
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=pod-c
                  pod-template-hash=bb48cc698
                  pool.type=default
                  version.type=released
Annotations:      prometheus.io/port: 8968
                  prometheus.io/scrape: true
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/pod-c-bb48cc698
Containers:
  pod-c:
    Image:       account-id.dkr.ecr.region.amazonaws.com/pod-c:redacted
    Ports:       8967/TCP, 8968/TCP, 8969/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCpod-cP
    Limits:
      memory:  1Gi
    Requests:
      cpu:      500m
      memory:   1Gi
    Readiness:  http-get http://:http/up delay=0s timeout=5s period=10s
    Environment Variables from:
      cluster-info  ConfigMap  Optional: false
    Environment:
      ...redacted...
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v4bfl (ro)
  jaeger-agent:
    Image:       jaegertracing/jaeger-agent:1.39.0
    Ports:       5778/TCP, 6831/UDP
    Host Ports:  0/TCP, 0/UDP
    Args:
      --reporter.grpc.host-port=dns:///jaeger-collector-headless.tracing:14250
      --reporter.grpc.discovery.min-peers=10
    Limits:
      memory:  128Mi
    Requests:
      cpu:        10m
      memory:     64Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v4bfl (ro)
Readiness Gates:
  Type                                                     Status
  target-health.elbv2.k8s.aws/k8s-default-pod-c            <none> 
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-v4bfl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  6m1s (x4 over 6m31s)  default-scheduler  0/13 nodes are available: 1 node(s) had untolerated taint {monitoring: }, 1 node(s) were unschedulable, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 4 Insufficient memory, 8 Insufficient cpu. preemption: 0/13 nodes are available: 4 Preemption is not helpful for scheduling, 9 No preemption victims found for incoming pod..
  Warning  FailedScheduling  5m3s (x2 over 5m14s)  default-scheduler  0/12 nodes are available: 1 node(s) had untolerated taint {monitoring: }, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 4 Insufficient memory, 8 Insufficient cpu. preemption: 0/12 nodes are available: 3 Preemption is not helpful for scheduling, 9 No preemption victims found for incoming pod..
  Normal   Nominated         30s (x4 over 6m30s)   karpenter          Pod should schedule on: machine/default-5v524, node/ip-X-Y-16-107.region.compute.internal
  ... FailedScheduling ...

Describe node

Name:               ip-x-y-16-107-region.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/instance-type=r6gd.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=region
                    failure-domain.beta.kubernetes.io/zone=zone
                    k8s.io/cloud-provider-aws=f275cbb56d826c5225e870ca953827c7
                    karpenter.k8s.aws/instance-category=r
                    karpenter.k8s.aws/instance-cpu=8
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=false
                    karpenter.k8s.aws/instance-family=r6gd
                    karpenter.k8s.aws/instance-generation=6
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-local-nvme=474
                    karpenter.k8s.aws/instance-memory=65536
                    karpenter.k8s.aws/instance-network-bandwidth=2500
                    karpenter.k8s.aws/instance-pods=58
                    karpenter.k8s.aws/instance-size=2xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/provisioner-name=default
                    karpenter.sh/registered=true
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=ip-X-Y-16-107.region.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=r6gd.2xlarge
                    topology.ebs.csi.aws.com/zone=zone
                    topology.kubernetes.io/region=region
                    topology.kubernetes.io/zone=zone
Annotations:        alpha.kubernetes.io/provided-node-ip: X.Y.16.107
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"redacted"}
                    karpenter.sh/managed-by: redacted
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 07 Aug 2023 15:39:30 +1200
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-X-Y-16-107.region.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 07 Aug 2023 15:50:24 +1200
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 07 Aug 2023 15:45:37 +1200   Mon, 07 Aug 2023 15:39:30 +1200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 07 Aug 2023 15:45:37 +1200   Mon, 07 Aug 2023 15:39:30 +1200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 07 Aug 2023 15:45:37 +1200   Mon, 07 Aug 2023 15:39:30 +1200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 07 Aug 2023 15:45:37 +1200   Mon, 07 Aug 2023 15:39:42 +1200   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   X.Y.16.107
  ExternalIP:   redacted
  InternalDNS:  ip-X-Y-16-107.region.compute.internal
  Hostname:     ip-X-Y-16-107.region.compute.internal
  ExternalDNS:  redacted.region.compute.amazonaws.com
Capacity:
  cpu:                8
  ephemeral-storage:  20949996Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             65034728Ki
  pods:               58
Allocatable:
  cpu:                7810m
  ephemeral-storage:  17160032634
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             63915496Ki
  pods:               58
System Info:
  Machine ID:                 ec23423095f1d606d50dadd119e81dc8
  System UUID:                ec234230-95f1-d606-d50d-add119e81dc8
  Boot ID:                    db569762-b5bf-4183-99b5-ef6537d59299
  Kernel Version:             5.10.184-175.749.amzn2.aarch64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.27.3-eks-a5565ad
  Kube-Proxy Version:         v1.27.3-eks-a5565ad
ProviderID:                   aws:///region/redacted
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                         ------------  ----------  ---------------  -------------  ---
  default                     pod-a                                        5510m (70%)   0 (0%)      8256Mi (13%)     8320Mi (13%)   10m
  default                     pod-b                                        2010m (25%)   0 (0%)      1984Mi (3%)      2Gi (3%)       11m
  kube-system                 aws-node-cldnd                               10m (0%)      0 (0%)      100Mi (0%)       0 (0%)         11m
  kube-system                 aws-node-termination-handler-gnplx           0 (0%)        0 (0%)      0 (0%)           0 (0%)         10m
  kube-system                 ebs-csi-node-w7zlg                           30m (0%)      0 (0%)      120Mi (0%)       768Mi (1%)     11m
  kube-system                 kube-proxy-48jh9                             100m (1%)     0 (0%)      0 (0%)           0 (0%)         11m
  logging                     fluent-bit-8vwrt                             100m (1%)     0 (0%)      32Mi (0%)        128Mi (0%)     10m
  monitoring                  prometheus-prometheus-node-exporter-nktcm    0 (0%)        0 (0%)      25Mi (0%)        0 (0%)         11m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                7760m (99%)    0 (0%)
  memory             10517Mi (16%)  11Gi (18%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  hugepages-32Mi     0 (0%)         0 (0%)
  hugepages-64Ki     0 (0%)         0 (0%)
Events:
  Type     Reason                   Age                From                   Message
  ----     ------                   ----               ----                   -------
  Normal   Starting                 10m                kube-proxy             
  Normal   Starting                 11m                kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      11m                kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  11m (x2 over 11m)  kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    11m (x2 over 11m)  kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     11m (x2 over 11m)  kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  11m                kubelet                Updated Node Allocatable limit across pods
  Normal   Synced                   10m                cloud-node-controller  Node synced successfully
  Normal   RegisteredNode           10m                node-controller        Node ip-X-Y-16-107.region.compute.internal event: Registered Node ip-X-Y-16-107.region.compute.internal in Controller
  Normal   DeprovisioningBlocked    10m                karpenter              Cannot deprovision node due to machine is not initialized
  Normal   NodeReady                10m                kubelet                Node ip-X-Y-16-107.region.compute.internal status is now: NodeReady
  Normal   DeprovisioningBlocked    40s (x5 over 10m)  karpenter              Cannot deprovision node due to machine is nominated

If you look at the Non-terminated Pods block you'll see that the kube-scheduler has added pod-a and pod-b to the node, and they have consumed most of the node's capacity. This is the same node that was nominated for the pending pod, pod-c, but it will never schedule there because there's no space for it.

So here we'd expect Karpenter to re-evaluate and nominate a new node. We don't see this though and pod-c remains in pending state indefinitely.

gomesdigital · 2023-08-08T03:50:25Z

Looking further it seems like Karpenter is nominating a node that just doesn't have enough space.

We're using a Runtime Class class on some of our pods and it looks like Karpenter doesn't take this into account when analyzing pod specs:

https://github.com/aws/karpenter-core/blob/main/pkg/utils/resources/resources.go#L25

// RequestsForPods returns the total resources of a variadic list of podspecs.
func RequestsForPods(pods ...*v1.Pod) v1.ResourceList {
	var resources []v1.ResourceList
	for _, pod := range pods {
		resources = append(resources, Ceiling(pod).Requests)
	}
	merged := Merge(resources...)
	merged[v1.ResourcePods] = *resource.NewQuantity(int64(len(pods)), resource.DecimalExponent)
	return merged
}

https://github.com/aws/karpenter-core/blob/main/pkg/utils/resources/resources.go#L97

// Ceiling calculates the max between the sum of container resources and max of initContainers
func Ceiling(pod *v1.Pod) v1.ResourceRequirements {
	var resources v1.ResourceRequirements
	for _, container := range pod.Spec.Containers {
		resources.Requests = Merge(resources.Requests, MergeResourceLimitsIntoRequests(container))
		resources.Limits = Merge(resources.Limits, container.Resources.Limits)
	}
	for _, container := range pod.Spec.InitContainers {
		resources.Requests = MaxResources(resources.Requests, MergeResourceLimitsIntoRequests(container))
		resources.Limits = MaxResources(resources.Limits, container.Resources.Limits)
	}
	return resources
}

gomesdigital · 2023-08-08T05:16:51Z

We've found a way to replicate the issue!

Add a runtime class with an overhead that does not fit on any node:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: trouble
handler: runc
overhead:
  podFixed:
    cpu: '24'  # no nodes in the provisioner that are >= 24 cores

Attach it to the pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: trouble
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: 'trouble'
  template:
    metadata:
      labels:
        app: 'trouble'
    spec:
      enableServiceLinks: false
      containers:
        - name: trouble
          image: public.ecr.aws/nginx/nginx:stable
      runtimeClassName: trouble

Pod describe events:

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  2m24s                default-scheduler  0/13 nodes are available: 1 node(s) had untolerated taint {monitoring: }, 10 Insufficient cpu, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}. preemption: 0/13 nodes are available: 10 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling..
  Normal   Nominated         13s (x2 over 2m23s)  karpenter          Pod should schedule on: machine/default-jn8lw, node/ip-x-y-1-78.region.compute.internal

tzneal · 2023-08-08T13:52:52Z

Thanks for the reproducer, not sure how we missed it, fairly certain we used to do this correctly. Its fixed and I added a test to prevent this from reoccurring.

levanlongktmt · 2024-05-23T03:38:33Z

@tzneal so weird, I got this issue today
2 pods with these config has been schedule to run in same node

initContainers:
  - name: bootstrap-data
    resources:
      requests:
        cpu: 300m
        memory: 2Gi
      limits:
        cpu: 1
        memory: 3Gi
containers:
  - name: db
    image: mysql:8.0.36
    resources:
      requests:
        cpu: 300m
        memory: 2Gi
      limits:
        cpu: 1
        memory: 3Gi

Karpenter launch the node with these info

beta.kubernetes.io/arch=arm64
beta.kubernetes.io/instance-type=r7gd.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-central-1
failure-domain.beta.kubernetes.io/zone=eu-central-1b
k8s.io/cloud-provider-aws=b62f791862f0bcb708f69b3ccd553e14
karpenter.k8s.aws/instance-category=r
**karpenter.k8s.aws/instance-cpu=1**
karpenter.k8s.aws/instance-cpu-manufacturer=aws
karpenter.k8s.aws/instance-encryption-in-transit-supported=true
**karpenter.k8s.aws/instance-family=r7gd**
karpenter.k8s.aws/instance-generation=7
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-local-nvme=59
**karpenter.k8s.aws/instance-memory=8192**
karpenter.k8s.aws/instance-network-bandwidth=520
karpenter.k8s.aws/instance-size=medium
karpenter.sh/capacity-type=spot
karpenter.sh/initialized=true
karpenter.sh/registered=true
kubernetes.io/arch=arm64
kubernetes.io/hostname=i-04fe8d3d19deae784.eu-central-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=r7gd.medium

The node show Cannot disrupt Node: Nominated for a pending pod, but both 2 pods stuck in Pod should schedule on: nodeclaim/database-dlxs6, node/i-04fe8d3d19deae784.eu-central-1.compute.internal

Im using karpenter v0.36.1

levanlongktmt · 2024-05-23T04:00:15Z

Update: it might related to tolerateAllTaints disabled kubernetes-sigs/aws-ebs-csi-driver#1955 (comment), after I delete tolerateAllTaints=false all pods up and run

My bad, it's because I have special taints for pool so daemon set of ebs driver cant be deploy

gomesdigital added the bug Something isn't working label Aug 7, 2023

tzneal mentioned this issue Aug 8, 2023

fix: add pod overhead into pod resources calculation kubernetes-sigs/karpenter#449

Merged

tzneal closed this as completed in kubernetes-sigs/karpenter#449 Aug 8, 2023

levanlongktmt mentioned this issue May 23, 2024

Volume still hang on Karpenter Node Consolidation/Termination kubernetes-sigs/aws-ebs-csi-driver#1955

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter not re-evaluating pending pods #4392

Karpenter not re-evaluating pending pods #4392

gomesdigital commented Aug 7, 2023 •

edited

Loading

tzneal commented Aug 7, 2023

gomesdigital commented Aug 7, 2023

gomesdigital commented Aug 8, 2023

gomesdigital commented Aug 8, 2023

tzneal commented Aug 8, 2023

levanlongktmt commented May 23, 2024

levanlongktmt commented May 23, 2024 •

edited

Loading

Karpenter not re-evaluating pending pods #4392

Karpenter not re-evaluating pending pods #4392

Comments

gomesdigital commented Aug 7, 2023 • edited Loading

Description

tzneal commented Aug 7, 2023

gomesdigital commented Aug 7, 2023

Describe pod

Describe node

gomesdigital commented Aug 8, 2023

gomesdigital commented Aug 8, 2023

tzneal commented Aug 8, 2023

levanlongktmt commented May 23, 2024

levanlongktmt commented May 23, 2024 • edited Loading

gomesdigital commented Aug 7, 2023 •

edited

Loading

levanlongktmt commented May 23, 2024 •

edited

Loading