Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support handling custom kube reserved values #1803

Open
stevehipwell opened this issue May 13, 2022 · 15 comments
Open

Support handling custom kube reserved values #1803

stevehipwell opened this issue May 13, 2022 · 15 comments
Assignees
Labels
feature New feature or request

Comments

@stevehipwell
Copy link
Contributor

Tell us about your request
I'd like Karpenter to work correctly when not using the EKS optimised AMI calculations for --kube-reserved (as already discussed in #1490). The current logic in Karpenter is incorrect as it makes the assumption that pod count is directly related to the number of ENIs on an instance and that the kube reserved memory calculation is a function of pods rather than memory.

I'd like Karpenter to use the GKE memory calculation when AWS_ENI_LIMITED_POD_DENSITY=false as the CPU limit calculation already comes from this source.

Karpenter should probably also set the value for ephemeral storage too.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We run EKS nodes using either custom networking and IPv4 IP prefixes or IPv6 IP prefixes which means that IP addresses aren't a scarcity. We follow the K8s large clusters guide so have a maximum 110 pods with custom kube reserved values based on the GKE calculations and we need Karpenter to respect these settings.

Are you currently working around this issue?
I'm hoping that Karpenter doesn't break with custom values set and if it does we'll move back to Cluster Autoscaler.

Additional context
I've been using this configuration in production ever since the AWS VPC CNI supported IP prefixes and it's solved a number of issues which we had with nodes before making this change.

Attachments
n/a

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@stevehipwell stevehipwell added the feature New feature or request label May 13, 2022
@stevehipwell
Copy link
Contributor Author

CC @bwagner5

@bwagner5
Copy link
Contributor

We'll be setting ephemeral storage accurately soon.

But agreed on Karpenter respecting a custom kube-reserved and, in general, overhead values for nodes.

@stevehipwell
Copy link
Contributor Author

Would you take a PR to use the GKE logic when AWS_ENI_LIMITED_POD_DENSITY=false as a stopgap?

@suket22
Copy link
Contributor

suket22 commented May 13, 2022

We're going to respect custom kube-reserved values. We might just do it as part of spec.kubeletConfiguration so you don't need to write a custom UserData script for it. That way it'd help our binpacking logic too which would need to account for it.

Changing the memoryReserved logic might surprise our other users so I'm hesitant to make that change.

@bwagner5
Copy link
Contributor

@stevehipwell I guess respecting a static value wouldn't help you much since you still need it to be a dynamic calc?

@stevehipwell
Copy link
Contributor Author

@suket22 having it available as part of spec.kubeletConfiguration would be great.

Changing the memoryReserved logic might surprise our other users so I'm hesitant to make that change.

I'm yet to see any documentation as to how the value was calculated and the goalposts have moved since then so I'm not sure why you're sticking with it as a calculation? Karpenter is pre v1 so each release is technically a breaking change.

But new logic could be added under a new env variable to act as a pair to AWS_ENI_LIMITED_POD_DENSITY.

@bwagner5 doesn't Karpenter know the instance type when it calculates the reserved values? I think it currently uses the number of CPUs.

@bwagner5
Copy link
Contributor

Yes, Karpenter knows the instance type and all specs of it, but it would need to be a calculation within the controller. spec.kubeletConfiguration would be used to set a static value.

@stevehipwell
Copy link
Contributor Author

@bwagner5 you're absolutely right! So I need the alternative algorithm and to make it match my user data.

@suket22 suket22 self-assigned this May 23, 2022
@suket22
Copy link
Contributor

suket22 commented May 24, 2022

Good point on a static value in the provisioner not helping here. But does a calculation inside Karpenter help either?
Karpenter doesn't know which instanceType fleet is going to select, so any value we set inside the controller might not be correct (unless ofcourse you've restricted Karpenter to use a set of homogenous instanceTypes but we probably can't rely on that).

It sounds to me that the right place this alternative calculation should exist is within the EKS AMI itself. In the short term, I think the right approach for Karpenter is probably letting you set these values within a UserData script.

@stevehipwell
Copy link
Contributor Author

@suket22 if Karpenter can calculate the CPU from the instance CPUs then it can calculate the memory as it's a function of instance memory. No matter what the calculation used is you're going to either need a deterministic calculation based on CPUs & memory that can be run on both sides, or a way to pass the calculated values back from the instance.

@stevehipwell
Copy link
Contributor Author

@bwagner5 where have we got with this? Not being able to configure our nodes to use the GKE calculation for kube-reserved memory is blocking our GA adoption of Karpenter.

@stevehipwell
Copy link
Contributor Author

stevehipwell commented Nov 13, 2023

After seeing the new spec for AKS reservations I've realised that this implementation should be flexible. I've got a proposed API for handling kube-reserved configuration as part of the NodePool. I also think this should be extended to also support ephemeral-storage in the future it it's also dependent on system resources.

Current

Only supports single node size per pool.

kubelet:
  kubeReserved:
    cpu: 200m
    memory: 100Mi

CPU Core Based

Supports EKS current algorithm.

  • Set reserved value for each core (1000m) with the last value being used for remaining cores (can be 0m)
kubelet:
  kubeReserved:
    cpuCalc:
      reserved: ["600m", "100m", "50m", "50m", "25m"]

Memory Pod Based

Supports current EKS algorithm.

  • Both max and min can be a percentage like 25%
kubelet:
  kubeReserved:
    memoryCalc:
      mode: pods
      multiple: 11m
      constant: 255m
      max:
      min:

New AKS pattern.

kubelet:
  kubeReserved:
    memoryCalc:
      mode: pods
      multiple: 20m
      constant: 50m
      max: 25%
      min:

Memory System Memory Based

Supports GKE algorithm (example shows up to 16GB memory).

  • Set reserved value for each GB or memory (1024Mi) with the last value being used for remaining GBs (can be 0Mi)
  • Both max and min can be a percentage like 25%
kubelet:
  kubeReserved:
    memoryCalc:
      mode: memory
      reserved: ["256Mi", "256Mi", "256Mi", "256Mi", "205Mi", "205Mi", "205Mi", "205Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi"]
      max:
      min: 255Mi

CC @bwagner5 @jonathan-innis

NOTE - I updated this comment to fix the API to avoid ambiguous types.

@jonathan-innis
Copy link
Contributor

I updated this comment to fix the API to avoid ambiguous types

Just doing a cursory look over the proposed API, this strikes me as really difficult for someone to reason about how Karpenter is actually going to reason about the way that Karpenter is going to go about calculating these values.

I still find myself leaning towards a static value approach (something similar but maybe not looking exactly like the instance type overrides approach) where some other component is responsible for reconciling those static values in a way that allows scenarios like the ones that you have called-out above

@stevehipwell
Copy link
Contributor Author

Just doing a cursory look over the proposed API, this strikes me as really difficult for someone to reason about how Karpenter is actually going to reason about the way that Karpenter is going to go about calculating these values.

@jonathan-innis the problem here is that we need to model a complex system so the API will be complex; if you need this functionality then I think you'd be able to understand it.

I still find myself leaning towards a static value approach (something similar but maybe not looking exactly like the instance type overrides approach) where some other component is responsible for reconciling those static values in a way that allows scenarios like the ones that you have called-out above

Yes that could work, are you thinking named algorithms. In a dynamic reserved scenario it'd be pretty easy to get it working in AL2 but for Bottlerocket we need a bootstrap container bottlerocket-os/bottlerocket#2010. Ideally published by Bottlerocket but Karpenter could create it's own as within AWS this would have trivial cost as it should just be the public ECR image with an alternate entrypoint.


All of that said I've had some more thoughts on this whole area and I think the following logic describes the whole system.

  • Kube reserved should be a function of the maximum number of pods
    • Additional functionality (such as Containerd plugins) could change the actual value required
  • Pod overhead could be used to limit the impact of kube reserved on small nodes
    • This isn't perfect as there is no principal of default overheads
    • This would require a mutating admission controller
  • System reserved memory should be a function of the SKU or large enough to accommodate the largest SKU
    • Greater node memory results in the kernel using more memory
  • System reserved CPU is likely to be fixed

I think the current EKS memory calculation is probably the worst of all worlds as it's based on ENI limits for pod density and Docker as the container runtime. I think the GKE memory calculation covers most of the above logic but isn't perfect; it's based on Docker, all dynamic reserved are in kube reserved & small nodes may still not have enough memory reserved. I think the current CPU calculation (all are the same) might be able to be simplified as I don't think it's a function of available cores.

TL;DR - Maybe static values would work on nodes over a certain size but we'd still need a pattern for small nodes unless using pod overhead.

@stevehipwell
Copy link
Contributor Author

We're currently experimenting running static kubeReserved and systemReserved across all node sizes as all of our nodes support a maximum of 110 pods. We're using the following values based on some experimentation we ran.

systemReserved:
  cpu: 100m
  memory: 100Mi
  ephemeral-storage: 1Gi

kubeReserved:
  cpu: 100m
  memory: 1465Mi
  ephemeral-storage: 1Gi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants