Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter cannot provision node with same instance type previously used by cluster-autoscaler #5676

Closed
dgdevops opened this issue Feb 16, 2024 · 14 comments · Fixed by #5788
Closed
Labels
question Further information is requested

Comments

@dgdevops
Copy link

dgdevops commented Feb 16, 2024

Description

Observed Behavior:
After the migration from cluster-autoscaler to Karpenter, Karpenter cannot provision worker node with the exact same instance type (m6a.2xlarge) cluster-autoscaler was configured for due to no instance type which had enough resources and the required offering met the scheduling requirements.

Scenario 1)
When removing the instance-family requirement (m6a) from the NodePool configuration and setting the instance-memory requirement (32768) Karpenter manages to provision node from the d & g instance categories with the instance type d3en.2xlarge (Karpenter also considers g4ad.2xlarge, g4dn.2xlarge based on the logs) that has the exact same amount of vCPU & Memory the m6a.2xlarge instance type has.

Scenario 2)
When removing the instance-family requirement (m6a) from the NodePool configuration and keeping the instance-cpu requirement (8) Karpenter manages to provision node from the r instance category with the instance type r6a.2xlarge that has the exact same amount of vCPU but double Memory (64GB) than the m6a.2xlarge instance type.

In case there is a worker node with instance type m6a.2xlarge already provisioned & available in the cluster the default-scheduler manages to allocate the workload to this worker node, this also confirms that the m6a.2xlarge instance has the capacity to allocate our workload.

The 1306 GitHub issue mentions similar behaviour, the same setup was tested with the vmMemoryOverheadPercent set to 0.01 (from the 0.075 default) with no improvements.

Temporary workarounds:

  1. Decrease the memory request of our workload by 2Gi to 26Gi and wait for Karpenter to provision a worker node with m6a.2xlarge instance type then revert back the memory request change
  2. Extend the karpenter.k8s.aws/instance-cpu requirement list with "16" to give Karpenter the possibility to (over)provision a worker node with m6a.4xlarge instance type

Expected Behavior:
Karpenter should be able to use the same instance type that was allocated to our workload when cluster-autoscaler was used. As the default scheduler of Kubernetes can allocate the workload to a worker node with m6a.2xlarge instance type Karpenter should not block the provisioning of the worker node.

Reproduction Steps (Please include YAML):

  1. Configure a NodePool with the details shared below
  2. Create a workload with the resource specifications shared below (toleration is required)
  3. Observe the Karpenter logs
  4. Extend the instance-cpu list with "16"
  5. Observe Karpenter provisioning a worker node with m6a.4xlarge instance type instead of m6a.2xlarge where the workload could previously fit when cluster-autoscaler was in use

NodePool configuration:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: <omitted>-master
spec:
  disruption:
    consolidateAfter: 60s
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    metadata:
      labels:
        <omitted>: <omitted>
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - m6a
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
            - '8'
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values:
            - nitro
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - eu-west-1a
            - eu-west-1b
            - eu-west-1c
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
      taints:
        - effect: NoSchedule
          key: dedicated
          value: <omitted>-master

Versions:

  • Chart Version: v0.32.1
  • Kubernetes Version (kubectl version): EKS 1.25

Further information:

  • AWS Region: eu-west-1
  • OS: BottleRocket OS (also tested AL2, no improvements)
  • Every minor version from v0.32.1 to v0.32.7 was tested, same behaviour was observed
  • We do not hit Insufficient instance capacity issues as at the time of the error EC2 instances with m6a.2xlarge instance type can be successfully provisioned manually
  • Workload resource specifications:
resources:
  limits:
    memory: 28Gi
  requests:
    cpu: '6'
    memory: 28Gi
  • Karpenter global configuration:
  aws.assumeRoleARN: ''
  aws.assumeRoleDuration: 15m
  aws.clusterCABundle: ''
  aws.clusterEndpoint: https://<omitted>
  aws.clusterName: <omitted>
  aws.enableENILimitedPodDensity: 'true'
  aws.enablePodENI: 'false'
  aws.interruptionQueueName: ''
  aws.isolatedVPC: 'true'
  aws.vmMemoryOverheadPercent: '0.075'
  batchIdleDuration: 1s
  batchMaxDuration: 10s
  featureGates.driftEnabled: 'false'

If we summarise the findings it is clearly visible that some instance types cannot allocate our workload based on Karpenter while others can despite the fact they have same vCPU & Memory specifications.
Can you please help us understand Karpenter's logic behind the instance choice?

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@dgdevops dgdevops added bug Something isn't working needs-triage Issues that need to be triaged labels Feb 16, 2024
@dgdevops dgdevops changed the title Karpenter cannot provision node for workloads with same instance type previously used by cluster-autoscaler Karpenter cannot provision node with same instance type previously used by cluster-autoscaler Feb 16, 2024
@jmdeal
Copy link
Contributor

jmdeal commented Feb 16, 2024

Could you provide logs, your pod spec, and your EC2NodeClass? Using the provided nodepool and a pod with the same resource requests (the original 28, not 26) Karpenter successfully provisions a m6a.2xlarge when I tried to reproduce.

@jmdeal jmdeal removed the needs-triage Issues that need to be triaged label Feb 16, 2024
@dgdevops
Copy link
Author

dgdevops commented Feb 16, 2024

Hello @jmdeal,
Thank you for your quick response.
Please find the requested outputs shared below:

Logs:
{"level":"ERROR","time":"2024-02-15T11:10:12.564Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"<omitted>-master\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, no instance type satisfied resources {\"cpu\":\"6480m\",\"memory\":\"30877471104\",\"pods\":\"6\"} and requirements <omitted>-server-role In [<omitted>], karpenter.k8s.aws/instance-cpu In [8], karpenter.k8s.aws/instance-family In [m6a], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [<omitted>-master], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], topology.kubernetes.io/zone In [eu-west-1a] (no instance type which had enough resources and the required offering met the scheduling requirements); incompatible with nodepool \"<omitted>-slave\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate dedicated=<omitted>slaves:NoSchedule; incompatible with nodepool \"on-demand\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate lifecycle=OnDemand:NoSchedule; incompatible with nodepool \"spot\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, incompatible requirements, label \"<omitted>-server-role\" does not have known values","commit":"1072d3b","pod":"<omitted>/<omitted>-master-1"}

Pod specifications:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: <omitted>-master
  namespace: <omitted>
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/instance: <omitted>-master
      app.kubernetes.io/name: <omitted>-master
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: <omitted>-master
        app.kubernetes.io/name: <omitted>-master
        <omitted>-role: master
      annotations:
    spec:
      initContainers:
        - name: volume
          image: <omitted>
          command:
            - <omitted>
          resources:
            limits:
              memory: 256Mi
            requests:
              cpu: 250m
              memory: 256Mi
      containers:
        - name: <omitted>
          image: <omitted>
          resources:
            limits:
              memory: 28Gi
            requests:
              cpu: '6'
              memory: 28Gi
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - eu-west-1a
                      - eu-west-1b
                      - eu-west-1c
                  - key: <omitted>-server-role
                    operator: In
                    values:
                      - <omitted>master
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/instance: <omitted>-master
                  app.kubernetes.io/name: <omitted>-master
              topologyKey: kubernetes.io/hostname
      schedulerName: default-scheduler
      tolerations:
        - key: dedicated
          operator: Equal
          value: <omitted>master
          effect: NoSchedule

EC2NodeClass:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
  instanceProfile: <omitted>
  securityGroups:
    - id: <omitted>
      name: <omitted>
    - id: <omitted>
      name: <omitted>
    - id: <omitted>
      name: <omitted>
  subnets:
    - id: <omitted>
      zone: eu-west-1c
    - id: <omitted>
      zone: eu-west-1a
    - id: <omitted>
      zone: eu-west-1b
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
    - id: ami-0c93c9f434e3d1c5e
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: >-
          arn:aws:kms:eu-west-1:<omitted>:key/<omitted>
        volumeSize: 20Gi
        volumeType: gp3
    - deviceName: /dev/xvdb
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: >-
          arn:aws:kms:eu-west-1:<omitted>:key/<omitted>
        volumeSize: 50Gi
        volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: optional
  role: <omitted>
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: <omitted>
    - id: <omitted>
  subnetSelectorTerms:
    - id: <omitted>
    - id: <omitted>
    - id: <omitted>

@jonathan-innis
Copy link
Contributor

From doing the conversion and looking at Karpenter's defaults, it looks like DaemonSet requests are definitely pushing y'all over the limit here for what Karpenter thinks is capable of instance type provisioning. Just doing the conversion

30877471104  = 29447Mi > 29317Mi

Were the logs that you printed above from your setup with vmMemoryOverheadPercent set to 0.01 or from it being set to 0.075?

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Feb 16, 2024

Just from trying to repro this scenario, when I dropped the VM_MEMORY_OVERHEAD_PERCENT environment variable down to 0, I was able to get a successful node launch.

From looking over your configuration, you need to set --set settings.vmMemoryOverheadPercent=0.001 if you are planning to override the vmMemoryOverheadPercent in v0.32.x. We had to figure out some way to pass through the defaults while still respecting the new values as overrides for the old values so here's the current behavior:

  1. settings.vmMemoryOverheadPercent is set by default to 0.075
  2. settings.vmMemoryOverheadPercent overrides any value set in setttings.aws.vmMemoryOverheadPercent
  3. If you only override settings.aws.vmMemoryOverheadPercent, the original value for settings.vmMemoryOverheadPercent is still maintained so you will not actually change anything if you only override this value.

Also, if you are trying to pass a "0" value through to vmMemoryOverheadPercent, helm has a weird quirk where it won't allow the value to pass through as non-null unless the value is a string (when the value is 0) so you can use --set-string settings.vmMemoryOverheadPercent=0 for that case.

@jonathan-innis jonathan-innis added question Further information is requested and removed bug Something isn't working labels Feb 16, 2024
@jonathan-innis
Copy link
Contributor

Separate from that I would say that there is an issue here: #4450 and a PR here: #4517 that we're trying to track to better the way that we do instance memory discovery moving forward.

@dgdevops
Copy link
Author

dgdevops commented Feb 16, 2024

Hello @jonathan-innis,
Thank you for the update.
I have updated the aws.vmMemoryOverheadPercent parameter in Karpenter's global configuration to 0.001, here are the logs from Karpenter:

{"level":"ERROR","time":"2024-02-16T20:53:52.412Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"spot\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, incompatible requirements, label \"<omitted>-server-role\" does not have known values; incompatible with nodepool \"<omitted>-master\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, no instance type satisfied resources {\"cpu\":\"6480m\",\"memory\":\"30877471104\",\"pods\":\"6\"} and requirements <omitted>-server-role In [<omitted>], karpenter.k8s.aws/instance-cpu In [8], karpenter.k8s.aws/instance-family In [m6a], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [<omitted>-master], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], topology.kubernetes.io/zone In [eu-west-1b] (no instance type which had enough resources and the required offering met the scheduling requirements); incompatible with nodepool \"<omitted>-slave\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate dedicated=<omitted>:NoSchedule; incompatible with nodepool \"on-demand\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate lifecycle=OnDemand:NoSchedule","commit":"1072d3b","pod":"<omitted>/<omitted>-master-2"}

Ideally the worker node should be provisioned by the master NodePool (instance-family: m6a, instance-cpu: "8")

@jonathan-innis
Copy link
Contributor

Sorry, did you update the aws.vmMemoryOverheadPercent or did you update the settings.vmMemoryOverheadPercent. You need to update the latter.

@dgdevops
Copy link
Author

Hello @jonathan-innis,
Karpenter’s global settings in our case currently includes the aws.vmMemoryOverheadPercentage parameter so I conducted the tests by editing that variable, I will test Karpenter’s behaviour after editing settings.vmMemoryOverheadPercentage.
Thank you

@dgdevops
Copy link
Author

dgdevops commented Feb 17, 2024

Hello @jonathan-innis,
I have set the settings.vmMemoryOverheadPercentage parameter's value to 0.001 in the global settings and I see the following logs from Karpenter:

{"level":"ERROR","time":"2024-02-17T10:42:10.494Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"<omitted>-slave\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate dedicated=<omitted>slaves:NoSchedule; incompatible with nodepool \"on-demand\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, did not tolerate lifecycle=OnDemand:NoSchedule; incompatible with nodepool \"spot\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, incompatible requirements, label \"<omitted>-server-role\" does not have known values; incompatible with nodepool \"<omitted>-master\", daemonset overhead={\"cpu\":\"480m\",\"memory\":\"812700032\",\"pods\":\"5\"}, no instance type satisfied resources {\"cpu\":\"6480m\",\"memory\":\"30877471104\",\"pods\":\"6\"} and requirements <omitted>-server-role In [<omitted>master], karpenter.k8s.aws/instance-cpu In [8], karpenter.k8s.aws/instance-family In [m6a], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [<omitted>-master], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], topology.kubernetes.io/zone In [eu-west-1b] (no instance type which had enough resources and the required offering met the scheduling requirements)","commit":"1072d3b","pod":"<omitted>/<omitted>-master-2"}

Here is the global configuration:

aws.assumeRoleARN: ''
aws.assumeRoleDuration: 15m
aws.clusterCABundle: ''
aws.clusterEndpoint: https://<omitted>
aws.clusterName: <omitted>
aws.enableENILimitedPodDensity: 'true'
aws.enablePodENI: 'false'
aws.interruptionQueueName: ''
aws.isolatedVPC: 'true'
aws.vmMemoryOverheadPercent: '0.075'
batchIdleDuration: 1s
batchMaxDuration: 10s
featureGates.driftEnabled: 'false'
vmMemoryOverheadPercent: '0.001'

Interestingly the moment I first changed the settings.vmMemoryOverheadPercent to 0.001 & restarted Karpenter I could see it provision a m6a.2xlarge instance for the workload. Afterwards I cordoned & drained this new node to see if Karpenter can schedule a replacement worker node, however the pods stayed in pending state and I saw the logs I shared above. I have also tried configuring settings.vmMemoryOverheadPercent parameter's value to 0 but no node has been provisioned by Karpenter since then.

@dgdevops
Copy link
Author

dgdevops commented Feb 19, 2024

Hello @jonathan-innis & @jmdeal,
I have continued the testing and found the culprit why Karpenter could not provision the worker node.
In Karpenter's global configuration ConfigMap the value of settings.vmMemoryOverheadPercent was set to 0.001, however the Karpenter deployment has the VM_MEMORY_OVERHEAD_PERCENT environment variable as well that was set to the default 0.075. The value of the VM memory overhead percentage in the environment variable and in the ConfigMap are tied together when using HELM Chart, this was not the case in our case as we use Kustomize.
After changing the environment variable's value to 0.001 Karpenter was able to provision the m6a.2xlarge instance type. I also tested 0.05 & 0.07, they also worked fine.

This leaves us with the remaining question why Karpenter (with the default settings, without the modification of the memory overhead parameter) considers the m6a.2xlarge instance type as not suitable for the workload and chooses d3en.2xlarge instead.

Additionally is there any detailed documentation about the memory overhead and its usage?
Is the value used only to calculate the proper instance type or does it have any implications on actual memory reservation?

@jmdeal
Copy link
Contributor

jmdeal commented Feb 19, 2024

The difference between the d3en.2xlarge and m6a.2xlarge memory has to do with some implicit kubeReserved calculations. I believe these are consistent across all currently supported AMI families, but basically if kubeReserved.memory is not set in the NodePool, kubeReserved.memory = 11*max_pods+255. This isn't explicitly documented by Karpenter since it's an AMI default but it probably should be since we do suggest in the docs if it's unset it will use the kubelet's defaults. This isn't the case for any of the supported AMI families.

@dgdevops
Copy link
Author

dgdevops commented Feb 21, 2024

@jonathan-innis & @jmdeal,
Thank you for all the details provided.
Does the value of VM_MEMORY_OVERHEAD_PERCENT used by Karpenter only to decide the proper instance type or does it have any influence on the actual memory reservation?

@jonathan-innis
Copy link
Contributor

any influence on the actual memory reservation

It is the percentage value that we lop off the top from the memory value that's presented from EC2 DescribeInstanceTypes. The issue here is that the capacity value that's presented by the kubelet is different from the memory presented by the DescribeInstanceTypes call. This is because there is some memory that is taken away and dedicated to the OS and some other memory that is utilized by the hypervisor. The percentage amount taken away from instance types is unfortunately not a smooth curve and I really think that the proper way to address this problem is just to generate data around it. If you take a look here: https://github.com/aws/karpenter-provider-aws/blob/33450d8f82ded870ce65fbde3cec14dbb2c04f50/pkg/providers/instancetype/zz_generated.memory_overhead.go you can see that I took a first-pass at trying to get the overhead details by launching nodes with instance types and then checking the difference between the DescribeInstanceType reported value and the actual node-reported value.

@dgdevops
Copy link
Author

dgdevops commented Feb 27, 2024

Hello @jonathan-innis,
Thank you for your input.
I see the auto-discovery feature requested in #5161, it is likely to bring the same challenges. From my point of view we can close this issue as the doubts were cleared, thank you for all the help everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants