Karpenter cannot provision node with same instance type previously used by cluster-autoscaler #5676

dgdevops · 2024-02-16T11:43:17Z

Description

Observed Behavior:
After the migration from cluster-autoscaler to Karpenter, Karpenter cannot provision worker node with the exact same instance type (m6a.2xlarge) cluster-autoscaler was configured for due to no instance type which had enough resources and the required offering met the scheduling requirements.

Scenario 1)
When removing the instance-family requirement (m6a) from the NodePool configuration and setting the instance-memory requirement (32768) Karpenter manages to provision node from the d & g instance categories with the instance type d3en.2xlarge (Karpenter also considers g4ad.2xlarge, g4dn.2xlarge based on the logs) that has the exact same amount of vCPU & Memory the m6a.2xlarge instance type has.

Scenario 2)
When removing the instance-family requirement (m6a) from the NodePool configuration and keeping the instance-cpu requirement (8) Karpenter manages to provision node from the r instance category with the instance type r6a.2xlarge that has the exact same amount of vCPU but double Memory (64GB) than the m6a.2xlarge instance type.

In case there is a worker node with instance type m6a.2xlarge already provisioned & available in the cluster the default-scheduler manages to allocate the workload to this worker node, this also confirms that the m6a.2xlarge instance has the capacity to allocate our workload.

The 1306 GitHub issue mentions similar behaviour, the same setup was tested with the vmMemoryOverheadPercent set to 0.01 (from the 0.075 default) with no improvements.

Temporary workarounds:

Decrease the memory request of our workload by 2Gi to 26Gi and wait for Karpenter to provision a worker node with m6a.2xlarge instance type then revert back the memory request change
Extend the karpenter.k8s.aws/instance-cpu requirement list with "16" to give Karpenter the possibility to (over)provision a worker node with m6a.4xlarge instance type

Expected Behavior:
Karpenter should be able to use the same instance type that was allocated to our workload when cluster-autoscaler was used. As the default scheduler of Kubernetes can allocate the workload to a worker node with m6a.2xlarge instance type Karpenter should not block the provisioning of the worker node.

Reproduction Steps (Please include YAML):

Configure a NodePool with the details shared below
Create a workload with the resource specifications shared below (toleration is required)
Observe the Karpenter logs
Extend the instance-cpu list with "16"
Observe Karpenter provisioning a worker node with m6a.4xlarge instance type instead of m6a.2xlarge where the workload could previously fit when cluster-autoscaler was in use

NodePool configuration:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: <omitted>-master
spec:
  disruption:
    consolidateAfter: 60s
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    metadata:
      labels:
        <omitted>: <omitted>
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - m6a
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
            - '8'
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values:
            - nitro
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - eu-west-1a
            - eu-west-1b
            - eu-west-1c
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
      taints:
        - effect: NoSchedule
          key: dedicated
          value: <omitted>-master

Versions:

Chart Version: v0.32.1
Kubernetes Version (kubectl version): EKS 1.25

Further information:

AWS Region: eu-west-1
OS: BottleRocket OS (also tested AL2, no improvements)
Every minor version from v0.32.1 to v0.32.7 was tested, same behaviour was observed
We do not hit Insufficient instance capacity issues as at the time of the error EC2 instances with m6a.2xlarge instance type can be successfully provisioned manually
Workload resource specifications:

resources:
  limits:
    memory: 28Gi
  requests:
    cpu: '6'
    memory: 28Gi

Karpenter global configuration:

  aws.assumeRoleARN: ''
  aws.assumeRoleDuration: 15m
  aws.clusterCABundle: ''
  aws.clusterEndpoint: https://<omitted>
  aws.clusterName: <omitted>
  aws.enableENILimitedPodDensity: 'true'
  aws.enablePodENI: 'false'
  aws.interruptionQueueName: ''
  aws.isolatedVPC: 'true'
  aws.vmMemoryOverheadPercent: '0.075'
  batchIdleDuration: 1s
  batchMaxDuration: 10s
  featureGates.driftEnabled: 'false'

If we summarise the findings it is clearly visible that some instance types cannot allocate our workload based on Karpenter while others can despite the fact they have same vCPU & Memory specifications.
Can you please help us understand Karpenter's logic behind the instance choice?

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jmdeal · 2024-02-16T17:50:43Z

Could you provide logs, your pod spec, and your EC2NodeClass? Using the provided nodepool and a pod with the same resource requests (the original 28, not 26) Karpenter successfully provisions a m6a.2xlarge when I tried to reproduce.

dgdevops · 2024-02-16T18:34:39Z

Hello @jmdeal,
Thank you for your quick response.
Please find the requested outputs shared below:

Logs:
{"level":"ERROR","time":"2024-02-15T11:10:12.564Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"<omitted>-master\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, no instance type satisfied resources {\"cpu\":\"6480m\",\"memory\":\"30877471104\",\"pods\":\"6\"} and requirements <omitted>-server-role In [<omitted>], karpenter.k8s.aws/instance-cpu In [8], karpenter.k8s.aws/instance-family In [m6a], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [<omitted>-master], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], topology.kubernetes.io/zone In [eu-west-1a] (no instance type which had enough resources and the required offering met the scheduling requirements); incompatible with nodepool \"<omitted>-slave\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate dedicated=<omitted>slaves:NoSchedule; incompatible with nodepool \"on-demand\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate lifecycle=OnDemand:NoSchedule; incompatible with nodepool \"spot\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, incompatible requirements, label \"<omitted>-server-role\" does not have known values","commit":"1072d3b","pod":"<omitted>/<omitted>-master-1"}

Pod specifications:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: <omitted>-master
  namespace: <omitted>
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/instance: <omitted>-master
      app.kubernetes.io/name: <omitted>-master
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: <omitted>-master
        app.kubernetes.io/name: <omitted>-master
        <omitted>-role: master
      annotations:
    spec:
      initContainers:
        - name: volume
          image: <omitted>
          command:
            - <omitted>
          resources:
            limits:
              memory: 256Mi
            requests:
              cpu: 250m
              memory: 256Mi
      containers:
        - name: <omitted>
          image: <omitted>
          resources:
            limits:
              memory: 28Gi
            requests:
              cpu: '6'
              memory: 28Gi
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - eu-west-1a
                      - eu-west-1b
                      - eu-west-1c
                  - key: <omitted>-server-role
                    operator: In
                    values:
                      - <omitted>master
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/instance: <omitted>-master
                  app.kubernetes.io/name: <omitted>-master
              topologyKey: kubernetes.io/hostname
      schedulerName: default-scheduler
      tolerations:
        - key: dedicated
          operator: Equal
          value: <omitted>master
          effect: NoSchedule

EC2NodeClass:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
  instanceProfile: <omitted>
  securityGroups:
    - id: <omitted>
      name: <omitted>
    - id: <omitted>
      name: <omitted>
    - id: <omitted>
      name: <omitted>
  subnets:
    - id: <omitted>
      zone: eu-west-1c
    - id: <omitted>
      zone: eu-west-1a
    - id: <omitted>
      zone: eu-west-1b
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
    - id: ami-0c93c9f434e3d1c5e
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: >-
          arn:aws:kms:eu-west-1:<omitted>:key/<omitted>
        volumeSize: 20Gi
        volumeType: gp3
    - deviceName: /dev/xvdb
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: >-
          arn:aws:kms:eu-west-1:<omitted>:key/<omitted>
        volumeSize: 50Gi
        volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: optional
  role: <omitted>
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: <omitted>
    - id: <omitted>
  subnetSelectorTerms:
    - id: <omitted>
    - id: <omitted>
    - id: <omitted>

jonathan-innis · 2024-02-16T19:44:42Z

From doing the conversion and looking at Karpenter's defaults, it looks like DaemonSet requests are definitely pushing y'all over the limit here for what Karpenter thinks is capable of instance type provisioning. Just doing the conversion

30877471104  = 29447Mi > 29317Mi

Were the logs that you printed above from your setup with vmMemoryOverheadPercent set to 0.01 or from it being set to 0.075?

jonathan-innis · 2024-02-16T20:38:30Z

Just from trying to repro this scenario, when I dropped the VM_MEMORY_OVERHEAD_PERCENT environment variable down to 0, I was able to get a successful node launch.

From looking over your configuration, you need to set --set settings.vmMemoryOverheadPercent=0.001 if you are planning to override the vmMemoryOverheadPercent in v0.32.x. We had to figure out some way to pass through the defaults while still respecting the new values as overrides for the old values so here's the current behavior:

settings.vmMemoryOverheadPercent is set by default to 0.075
settings.vmMemoryOverheadPercent overrides any value set in setttings.aws.vmMemoryOverheadPercent
If you only override settings.aws.vmMemoryOverheadPercent, the original value for settings.vmMemoryOverheadPercent is still maintained so you will not actually change anything if you only override this value.

Also, if you are trying to pass a "0" value through to vmMemoryOverheadPercent, helm has a weird quirk where it won't allow the value to pass through as non-null unless the value is a string (when the value is 0) so you can use --set-string settings.vmMemoryOverheadPercent=0 for that case.

jonathan-innis · 2024-02-16T20:49:25Z

Separate from that I would say that there is an issue here: #4450 and a PR here: #4517 that we're trying to track to better the way that we do instance memory discovery moving forward.

dgdevops · 2024-02-16T20:57:26Z

Hello @jonathan-innis,
Thank you for the update.
I have updated the aws.vmMemoryOverheadPercent parameter in Karpenter's global configuration to 0.001, here are the logs from Karpenter:

{"level":"ERROR","time":"2024-02-16T20:53:52.412Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"spot\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, incompatible requirements, label \"<omitted>-server-role\" does not have known values; incompatible with nodepool \"<omitted>-master\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, no instance type satisfied resources {\"cpu\":\"6480m\",\"memory\":\"30877471104\",\"pods\":\"6\"} and requirements <omitted>-server-role In [<omitted>], karpenter.k8s.aws/instance-cpu In [8], karpenter.k8s.aws/instance-family In [m6a], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [<omitted>-master], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], topology.kubernetes.io/zone In [eu-west-1b] (no instance type which had enough resources and the required offering met the scheduling requirements); incompatible with nodepool \"<omitted>-slave\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate dedicated=<omitted>:NoSchedule; incompatible with nodepool \"on-demand\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate lifecycle=OnDemand:NoSchedule","commit":"1072d3b","pod":"<omitted>/<omitted>-master-2"}

Ideally the worker node should be provisioned by the master NodePool (instance-family: m6a, instance-cpu: "8")

jonathan-innis · 2024-02-17T00:22:51Z

Sorry, did you update the aws.vmMemoryOverheadPercent or did you update the settings.vmMemoryOverheadPercent. You need to update the latter.

dgdevops · 2024-02-17T08:06:44Z

Hello @jonathan-innis,
Karpenter’s global settings in our case currently includes the aws.vmMemoryOverheadPercentage parameter so I conducted the tests by editing that variable, I will test Karpenter’s behaviour after editing settings.vmMemoryOverheadPercentage.
Thank you

dgdevops · 2024-02-17T11:02:34Z

Hello @jonathan-innis,
I have set the settings.vmMemoryOverheadPercentage parameter's value to 0.001 in the global settings and I see the following logs from Karpenter:

{"level":"ERROR","time":"2024-02-17T10:42:10.494Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"<omitted>-slave\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate dedicated=<omitted>slaves:NoSchedule; incompatible with nodepool \"on-demand\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate lifecycle=OnDemand:NoSchedule; incompatible with nodepool \"spot\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, incompatible requirements, label \"<omitted>-server-role\" does not have known values; incompatible with nodepool \"<omitted>-master\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, no instance type satisfied resources {\"cpu\":\"6480m\",\"memory\":\"30877471104\",\"pods\":\"6\"} and requirements <omitted>-server-role In [<omitted>master], karpenter.k8s.aws/instance-cpu In [8], karpenter.k8s.aws/instance-family In [m6a], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [<omitted>-master], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], topology.kubernetes.io/zone In [eu-west-1b] (no instance type which had enough resources and the required offering met the scheduling requirements)","commit":"1072d3b","pod":"<omitted>/<omitted>-master-2"}

Here is the global configuration:

aws.assumeRoleARN: ''
aws.assumeRoleDuration: 15m
aws.clusterCABundle: ''
aws.clusterEndpoint: https://<omitted>
aws.clusterName: <omitted>
aws.enableENILimitedPodDensity: 'true'
aws.enablePodENI: 'false'
aws.interruptionQueueName: ''
aws.isolatedVPC: 'true'
aws.vmMemoryOverheadPercent: '0.075'
batchIdleDuration: 1s
batchMaxDuration: 10s
featureGates.driftEnabled: 'false'
vmMemoryOverheadPercent: '0.001'

Interestingly the moment I first changed the settings.vmMemoryOverheadPercent to 0.001 & restarted Karpenter I could see it provision a m6a.2xlarge instance for the workload. Afterwards I cordoned & drained this new node to see if Karpenter can schedule a replacement worker node, however the pods stayed in pending state and I saw the logs I shared above. I have also tried configuring settings.vmMemoryOverheadPercent parameter's value to 0 but no node has been provisioned by Karpenter since then.

dgdevops · 2024-02-19T11:07:46Z

Hello @jonathan-innis & @jmdeal,
I have continued the testing and found the culprit why Karpenter could not provision the worker node.
In Karpenter's global configuration ConfigMap the value of settings.vmMemoryOverheadPercent was set to 0.001, however the Karpenter deployment has the VM_MEMORY_OVERHEAD_PERCENT environment variable as well that was set to the default 0.075. The value of the VM memory overhead percentage in the environment variable and in the ConfigMap are tied together when using HELM Chart, this was not the case in our case as we use Kustomize.
After changing the environment variable's value to 0.001 Karpenter was able to provision the m6a.2xlarge instance type. I also tested 0.05 & 0.07, they also worked fine.

This leaves us with the remaining question why Karpenter (with the default settings, without the modification of the memory overhead parameter) considers the m6a.2xlarge instance type as not suitable for the workload and chooses d3en.2xlarge instead.

Additionally is there any detailed documentation about the memory overhead and its usage?
Is the value used only to calculate the proper instance type or does it have any implications on actual memory reservation?

jmdeal · 2024-02-19T21:59:52Z

The difference between the d3en.2xlarge and m6a.2xlarge memory has to do with some implicit kubeReserved calculations. I believe these are consistent across all currently supported AMI families, but basically if kubeReserved.memory is not set in the NodePool, kubeReserved.memory = 11*max_pods+255. This isn't explicitly documented by Karpenter since it's an AMI default but it probably should be since we do suggest in the docs if it's unset it will use the kubelet's defaults. This isn't the case for any of the supported AMI families.

dgdevops · 2024-02-21T15:25:03Z

@jonathan-innis & @jmdeal,
Thank you for all the details provided.
Does the value of VM_MEMORY_OVERHEAD_PERCENT used by Karpenter only to decide the proper instance type or does it have any influence on the actual memory reservation?

jonathan-innis · 2024-02-22T07:20:49Z

any influence on the actual memory reservation

It is the percentage value that we lop off the top from the memory value that's presented from EC2 DescribeInstanceTypes. The issue here is that the capacity value that's presented by the kubelet is different from the memory presented by the DescribeInstanceTypes call. This is because there is some memory that is taken away and dedicated to the OS and some other memory that is utilized by the hypervisor. The percentage amount taken away from instance types is unfortunately not a smooth curve and I really think that the proper way to address this problem is just to generate data around it. If you take a look here: https://github.com/aws/karpenter-provider-aws/blob/33450d8f82ded870ce65fbde3cec14dbb2c04f50/pkg/providers/instancetype/zz_generated.memory_overhead.go you can see that I took a first-pass at trying to get the overhead details by launching nodes with instance types and then checking the difference between the DescribeInstanceType reported value and the actual node-reported value.

dgdevops · 2024-02-27T13:37:48Z

Hello @jonathan-innis,
Thank you for your input.
I see the auto-discovery feature requested in #5161, it is likely to bring the same challenges. From my point of view we can close this issue as the doubts were cleared, thank you for all the help everyone.

dgdevops added bug Something isn't working needs-triage Issues that need to be triaged labels Feb 16, 2024

dgdevops changed the title ~~Karpenter cannot provision node for workloads with same instance type previously used by cluster-autoscaler~~ Karpenter cannot provision node with same instance type previously used by cluster-autoscaler Feb 16, 2024

jmdeal removed the needs-triage Issues that need to be triaged label Feb 16, 2024

jonathan-innis added question Further information is requested and removed bug Something isn't working labels Feb 16, 2024

jmdeal mentioned this issue Mar 5, 2024

docs: add kubeReserved default note #5788

Merged

3 tasks

jmdeal closed this as completed in #5788 Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter cannot provision node with same instance type previously used by cluster-autoscaler #5676

Karpenter cannot provision node with same instance type previously used by cluster-autoscaler #5676

dgdevops commented Feb 16, 2024 •

edited

Loading

jmdeal commented Feb 16, 2024 •

edited

Loading

dgdevops commented Feb 16, 2024 •

edited

Loading

jonathan-innis commented Feb 16, 2024

jonathan-innis commented Feb 16, 2024 •

edited

Loading

jonathan-innis commented Feb 16, 2024

dgdevops commented Feb 16, 2024 •

edited

Loading

jonathan-innis commented Feb 17, 2024

dgdevops commented Feb 17, 2024

dgdevops commented Feb 17, 2024 •

edited

Loading

dgdevops commented Feb 19, 2024 •

edited

Loading

jmdeal commented Feb 19, 2024 •

edited

Loading

dgdevops commented Feb 21, 2024 •

edited

Loading

jonathan-innis commented Feb 22, 2024

dgdevops commented Feb 27, 2024 •

edited

Loading

Karpenter cannot provision node with same instance type previously used by cluster-autoscaler #5676

Karpenter cannot provision node with same instance type previously used by cluster-autoscaler #5676

Comments

dgdevops commented Feb 16, 2024 • edited Loading

Description

jmdeal commented Feb 16, 2024 • edited Loading

dgdevops commented Feb 16, 2024 • edited Loading

jonathan-innis commented Feb 16, 2024

jonathan-innis commented Feb 16, 2024 • edited Loading

jonathan-innis commented Feb 16, 2024

dgdevops commented Feb 16, 2024 • edited Loading

jonathan-innis commented Feb 17, 2024

dgdevops commented Feb 17, 2024

dgdevops commented Feb 17, 2024 • edited Loading

dgdevops commented Feb 19, 2024 • edited Loading

jmdeal commented Feb 19, 2024 • edited Loading

dgdevops commented Feb 21, 2024 • edited Loading

jonathan-innis commented Feb 22, 2024

dgdevops commented Feb 27, 2024 • edited Loading

dgdevops commented Feb 16, 2024 •

edited

Loading

jmdeal commented Feb 16, 2024 •

edited

Loading

dgdevops commented Feb 16, 2024 •

edited

Loading

jonathan-innis commented Feb 16, 2024 •

edited

Loading

dgdevops commented Feb 16, 2024 •

edited

Loading

dgdevops commented Feb 17, 2024 •

edited

Loading

dgdevops commented Feb 19, 2024 •

edited

Loading

jmdeal commented Feb 19, 2024 •

edited

Loading

dgdevops commented Feb 21, 2024 •

edited

Loading

dgdevops commented Feb 27, 2024 •

edited

Loading