Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter is nominating machine that have not enough memory #4450

Closed
ckhelifi opened this issue Aug 17, 2023 · 8 comments
Closed

Karpenter is nominating machine that have not enough memory #4450

ckhelifi opened this issue Aug 17, 2023 · 8 comments
Assignees
Labels
bug Something isn't working v1 Issues requiring resolution by the v1 milestone

Comments

@ckhelifi
Copy link

ckhelifi commented Aug 17, 2023

Description

Observed Behavior:

The pod is still in Pending state, Karpenter is nominating a mchine that have not enough memory.

Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Normal   Nominated         37m (x3 over 41m)     karpenter          Pod should schedule on: machine/default-bjb87
  Normal   Nominated         9m51s (x13 over 34m)  karpenter          Pod should schedule on: machine/default-bjb87
  Warning  FailedScheduling  4m41s (x11 over 41m)  default-scheduler  0/4 nodes are available: 4 Insufficient memory. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
  Normal   Nominated         81s (x4 over 7m41s)   karpenter          Pod should schedule on: machine/default-j956g
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kubectl get machine
NAME            TYPE         ZONE         NODE                                          READY   AGE
default-7r6vx   c5a.large    eu-west-3c   ip-10-89-176-188.eu-west-3.compute.internal   True    25h
default-j956g   r5a.large    eu-west-3a   ip-10-89-37-49.eu-west-3.compute.internal     True    5d22h
default-x2cgk   r5a.xlarge   eu-west-3b   ip-10-89-110-249.eu-west-3.compute.internal   True    5d22h
metro-lb654     c5a.large    eu-west-3a   ip-10-89-15-199.eu-west-3.compute.internal    True    24h
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kdpo private-batch-sit-metro-app-nginx-ingress-656c5c8c8c-sms7r -n private-batch-sit-metro
Name:           private-batch-sit-metro-app-nginx-ingress-656c5c8c8c-sms7r
Namespace:      private-batch-sit-metro
Priority:       0
Node:           <none>
Labels:         app=private-batch-sit-metro-app-nginx-ingress
                pod-template-hash=656c5c8c8c
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io/port: 9113
                prometheus.io/scheme: http
                prometheus.io/scrape: true
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/private-batch-sit-metro-app-nginx-ingress-656c5c8c8c
Containers:
  private-batch-sit-metro-app-nginx-ingress:
    Image:       474820181376.dkr.ecr.eu-west-3.amazonaws.com/external/docker.io/nginx/nginx-ingress:2.3.0
    Ports:       80/TCP, 443/TCP, 9113/TCP, 8081/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      -nginx-plus=false
      -nginx-reload-timeout=60000
      -enable-app-protect=false
      -enable-app-protect-dos=false
      -nginx-configmaps=$(POD_NAMESPACE)/private-batch-sit-metro-app-nginx-ingress
      -default-server-tls-secret=private-batch-sit-metro/api-certificate
      -ingress-class=private-batch-sit-metro-services-nginx
      -watch-namespace=private-batch-sit-metro
      -health-status=false
      -health-status-uri=/nginx-health
      -nginx-debug=false
      -v=1
      -nginx-status=true
      -nginx-status-port=8080
      -nginx-status-allow-cidrs=127.0.0.1
      -report-ingress-status
      -external-service=private-batch-sit-metro-app-nginx-ingress
      -enable-leader-election=true
      -leader-election-lock-name=private-batch-sit-metro-app-nginx-ingress-leader-election
      -wildcard-tls-secret=private-batch-sit-metro/api-certificate
      -enable-prometheus-metrics=true
      -prometheus-metrics-listen-port=9113
      -prometheus-tls-secret=
      -enable-custom-resources=true
      -enable-snippets=false
      -enable-tls-passthrough=false
      -enable-preview-policies=false
      -enable-cert-manager=false
      -enable-oidc=false
      -enable-external-dns=false
      -ready-status=true
      -ready-status-port=8081
      -enable-latency-metrics=false
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:readiness-port/nginx-ready delay=0s timeout=1s period=1s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:  private-batch-sit-metro (v1:metadata.namespace)
      POD_NAME:       private-batch-sit-metro-app-nginx-ingress-656c5c8c8c-sms7r (v1:metadata.name)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zbsjd (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-zbsjd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              mmb.fr/domain=metro
Tolerations:                 mmb.fr/domain=metro:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Normal   Nominated         3m19s (x60 over 124m)  karpenter          Pod should schedule on: machine/metro-lb654
  Warning  FailedScheduling  3m9s (x293 over 24h)   default-scheduler  0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kubectl get machine
NAME            TYPE        ZONE         NODE                                          READY   AGE
default-7r6vx   c5a.large   eu-west-3c   ip-10-89-176-188.eu-west-3.compute.internal   True    26h
default-fn5pp   m5a.large   eu-west-3a   ip-10-89-6-228.eu-west-3.compute.internal     False   53s
default-svvgn   r5a.large   eu-west-3b   ip-10-89-96-59.eu-west-3.compute.internal     True    2m35s
metro-lb654     c5a.large   eu-west-3a   ip-10-89-15-199.eu-west-3.compute.internal    True    25h
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kgno ip-10-89-15-199.eu-west-3.compute.internal -o yaml 
Error from server (NotFound): nodes "ip-10-89-15-199.eu-west-3.compute.internal" not found
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kgno 
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-89-118-244.eu-west-3.compute.internal   Ready    <none>   24h     v1.24.15-eks-a5565ad
ip-10-89-138-225.eu-west-3.compute.internal   Ready    <none>   24h     v1.24.15-eks-a5565ad
ip-10-89-176-188.eu-west-3.compute.internal   Ready    <none>   26h     v1.24.15-eks-a5565ad
ip-10-89-36-219.eu-west-3.compute.internal    Ready    <none>   24h     v1.24.15-eks-a5565ad
ip-10-89-6-228.eu-west-3.compute.internal     Ready    <none>   51s     v1.24.15-eks-a5565ad
ip-10-89-96-59.eu-west-3.compute.internal     Ready    <none>   2m25s   v1.24.15-eks-a5565ad
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kubectl delete machine metro-lb654
machine.karpenter.sh "metro-lb654" deleted
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kubectl get machine
NAME            TYPE        ZONE         NODE                                          READY   AGE
default-7r6vx   c5a.large   eu-west-3c   ip-10-89-176-188.eu-west-3.compute.internal   True    26h
default-fn5pp   m5a.large   eu-west-3a   ip-10-89-6-228.eu-west-3.compute.internal     True    106s
default-svvgn   r5a.large   eu-west-3b   ip-10-89-96-59.eu-west-3.compute.internal     True    3m28s
metro-tkh79     c5a.large   eu-west-3c                                                 False   6s
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kubectl ge^C
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kubectl get machine
NAME            TYPE        ZONE         NODE                                          READY   AGE
default-7r6vx   c5a.large   eu-west-3c   ip-10-89-176-188.eu-west-3.compute.internal   True    26h
default-svvgn   r5a.large   eu-west-3b   ip-10-89-96-59.eu-west-3.compute.internal     True    5m3s
metro-tkh79     c5a.large   eu-west-3c   ip-10-89-163-253.eu-west-3.compute.internal   True    101s
vagrant@vagrant-500001464:~/workspace/socles/infra-kubernetes/karpenter$ kgno ip-10-89-163-253.eu-west-3.compute.internal
NAME                                          STATUS   ROLES    AGE   VERSION
ip-10-89-163-253.eu-west-3.compute.internal   Ready    <none>   86s   v1.24.15-eks-a5565ad
```

The `machine` nominated by Karpenter is link to a node which does not exist any more into the cluster.
If i delete the `machine` manually, Karpenter create a new one and everything go back to normal.
It seems that Karpenter downscaling removed only the node and not the `machine` ?

**Expected Behavior**:

Karpenter should start a new machine or node

**Reproduction Steps** (Please include YAML):

**Versions**:
- Kubernetes Version (`kubectl version`):  1.24



* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ckhelifi ckhelifi added the bug Something isn't working label Aug 17, 2023
@bwagner5
Copy link
Contributor

This bug should be fixed from kubernetes-sigs/karpenter#449 and will be released early next week.

@runningman84
Copy link

@bwagner5 it looks like there is still now new release... we are also suffering from this issue.

@albertschwarzkopf
Copy link

albertschwarzkopf commented Aug 29, 2023

Tested with 0.30.0-rc.0. The same behaviour. I mean Karpenter provisions new nodes (in this case c5d.large), but the pending pod cannot be scheduled there. Just I remove one of the Daemonsets (e.g. node-problem-detector) then the pending pod can be scheduled.

│ dy: }, 3 Insufficient memory, 4 node(s) had untolerated taint {arch: arm64}, 6 Insufficient cpu. preemption: 0/12 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 4 No preemption victims found for incoming pod, 6 Pree │
│ mption is not helpful for scheduling..                                                                                                                                                                                                   │
│   Warning  FailedScheduling  13m (x11 over 18m)      default-scheduler  0/12 nodes are available: 1 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 4 Insufficient memory, 4 node(s) had untolerated taint {arc │
│ h: arm64}, 6 Insufficient cpu. preemption: 0/12 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 5 No preemption victims found for incoming pod, 5 Preemption is not helpful for scheduling..                             │
│   Normal   Nominated         12m                     karpenter          Pod should schedule on: machine/x86-4nkg6                                                                                                                        │
│   Normal   Nominated         10m                     karpenter          Pod should schedule on: machine/x86-6bh65                                                                                                                        │
│   Normal   Nominated         8m47s                   karpenter          Pod should schedule on: machine/x86-frgp4                                                                                                                        │
│   Normal   Nominated         7m14s                   karpenter          Pod should schedule on: machine/x86-r9kbv                                                                                                                        │
│   Normal   Nominated         5m5s                    karpenter          Pod should schedule on: machine/x86-8r955                                                                                                                        │
│   Warning  FailedScheduling  4m16s (x15 over 9m33s)  default-scheduler  (combined from similar events): 0/13 nodes are available: 1 node(s) were unschedulable, 4 node(s) had untolerated taint {arch: arm64}, 5 Insufficient memory, 6  │
│ Insufficient cpu. preemption: 0/13 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 5 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod..                                          │
│   Normal   Nominated         3m22s                   karpenter          Pod should schedule on: machine/x86-pjm97                                                                                                                        │
│   Normal   Nominated         82s                     karpenter          Pod should schedule on: machine/x86-m2rbw

If I compare the memory resources then I can see difference between the Karpenter calculation and the allocatable resources:

│ 2023-08-29T07:57:33.121Z    INFO    controller.provisioner    computed new machine(s) to fit pod(s)    {"commit": "34d50bf-dirty", "machines": 1, "pods": 1}                                                                             │
│ 2023-08-29T07:57:33.476Z    INFO    controller.provisioner    created machine    {"commit": "34d50bf-dirty", "provisioner": "x86", "requests": {"cpu":"1233m","memory":"3330277375","pods":"7"}, "instance-types": "c5.2xlarge, c5.4xlar │
│ ge, c5.large, c5.xlarge, c5a.2xlarge and 95 other(s)"}                                                                                                                                                                                   │
│ 2023-08-29T07:57:35.964Z    DEBUG    controller.provisioner    waiting on cluster sync    {"commit": "34d50bf-dirty"}                                                                                                                    │
│ 2023-08-29T07:57:36.339Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "34d50bf-dirty", "machine": "x86-mjjt6", "provisioner": "x86", "provider-id": "aws:///eu-central-1c/i-0fcb91969158c38df", "instance-t │
│ ype": "c5d.large", "zone": "eu-central-1c", "capacity-type": "spot", "allocatable": {"cpu":"1500m","ephemeral-storage":"43Gi","memory":"3337827123200m","pods":"29"}}
│  cpu:         1500m                                                                                                       │
│  ephemeral-storage:  45391280050                                                                                                    │
│  hugepages-1Gi:    0                                                                                                         │
│  hugepages-2Mi:    0                                                                                                         │
│  memory:       3236475699200m                                                                                                   │
│  pods:        29```

@runningman84
Copy link

In 0.30.0 the same problem exists @bwagner5 @jonathan-innis

@ellistarn ellistarn reopened this Sep 2, 2023
@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 4, 2023

It looks like this has to do with the vmMemoryOverheadPercent calculation being too small compared to the actual overhead that is taken away from the capacity that EC2 provides us through their API. I did some quick generation of some data and compared the difference between the EC2-provided capacity value and the actual launched capacity value when using AL2: capacity-diff.csv.

One workaround for this is to raise your vmMemoryOverheadPercent that Karpenter uses for pod estimation so that it avoids nominating nodes that don't have enough memory. Realistically, we should have an upstream fix for this issue. There's a PR that's currently in draft (#4517) that's intended to fix this issue by generating out the capacity values rather than taking a heuristic overhead based on a percentage like we are doing now.

@St0rmRage
Copy link

It looks like this has to do with the vmMemoryOverheadPercent calculation being too small compared to the actual overhead that is taken away from the capacity that EC2 provides us through their API. I did some quick generation of some data and compared the difference between the EC2-provided capacity value and the actual launched capacity value when using AL2: capacity-diff.csv.

One workaround for this is to raise your vmMemoryOverheadPercent that Karpenter uses for pod estimation so that it avoids nominating nodes that don't have enough memory. Realistically, we should have an upstream fix for this issue. There's a PR that's currently in draft (#4517) that's intended to fix this issue by generating out the capacity values rather than taking a heuristic overhead based on a percentage like we are doing now.

So following this it just seems that if we are using other OS (Bottlerocket) there will be difference in the allocatable of the nodes and the list in #4517 will not apply for Bottlerocket and it will need to be regenerated specifically.
And for a current workaround which we've been struggling with this issue, how much should we increase the vmMemoryOverheadPercent?
Thanks

@jonathan-innis
Copy link
Contributor

how much should we increase the vmMemoryOverheadPercent

From the looks of the generated data, it looks like you may need to put it up to 0.11 for the c5d.large.

there will be difference in the allocatable of the nodes

We are planning to gen a list that considers all of our supported OSs and then either:

  1. Differentiates between them by keying on the AMIFamily name or
  2. Takes the maximum between all AMIFamilies and uses that as the overhead for all

@njtran
Copy link
Contributor

njtran commented Aug 12, 2024

Closing in favor of #5161

@njtran njtran closed this as completed Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v1 Issues requiring resolution by the v1 milestone
Projects
None yet
Development

No branches or pull requests

8 participants