Can nodes provision faster than 2-4 minutes? Also, can Karpenter provision additional nodes based on node resource metrics? #2906

toddatapiture · 2022-11-21T20:53:48Z

Is an existing page relevant?

N/A

What karpenter features are relevant?

We have had great success with Karpenter and really enjoy this project for our platforms. I want to ask a question about node provisioning times. Also, can nodes be provisioned based on node resource metrics (CPU and Memory)?

For some services, we have created this hasn't been an issue. We create a helm chart for our services and deploy them onto the EKS cluster as a deployment. We are also using HPA (horizontal pod autoscaling) so once a deployment Pod reaches a threshold (resource metric) we create more replicas of that Pod. This then triggers a scaling event in Karpenter to provision new nodes so these Pods can get scheduled to them. It also works fluently as well because we can start the provisioning process at say 70% of the resources on the Pod so there's still a little cushion on the Pod while Pods and Nodes are scaling. This works great and as expected.

The issue we are encountering is we are migrating our Jenkins agents to K8s and using Karpenter to scale our agents. We don't have deployments for the agents so we don't have HPA. Each Jenkins job itself creates a Pod (via Pod definition) and those Pods are scheduled onto Karpenter provisioned nodes. This isn't too bad during the day under high traffic. But, first thing in the mornings or late at night during releases we have no nodes running. So when a Jenkins job is started it creates a build (Pod) for that Jenkins job and then a Karpenter scaling event happens. During this event, it takes 2-4 to provision a node, and some of the Jenkins jobs utilized for releases only take 30 seconds to run. There's been pushback from stakeholders as the provisioning is taking longer than the actual release task (run of Jenkins job). It's not a lot of time but when you watch the console output it can feel like a lot of time waiting 2-4 minutes.

The other thing is new nodes don't get provisioned until Pods are un-schedulable with our Jenkins setup. As mentioned above if we could utilize something like HPA we could start scaling the Pods in advance to prepare node provision in advance. This would be a fluid flow and seamless. But, the way it currently works we have to hit our limitation wall first then schedule second. Is there any way of having Karpenter provision new nodes in unique instances like this based on node resource metrics (CPU /Memory)? If the Node is as x % go ahead and scale.

Any help or feedback would be greatly appreciated!

How should the docs be improved?

N/A

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

bwagner5 · 2022-11-21T21:26:36Z

Are you using a custom AMI with user-data? We usually see nodes provisioned in at least 55 secs today and often 45-50 seconds.

We are working on improving the time it takes nodes to provision which we will be releasing soon. More info about the AMI, user-data, and node initialization (what instance types, CNI, region, etc) may help us provide recommendations to expedite the provisioning.

FernandoMiguel · 2022-11-22T11:10:50Z

a hack some folks use is having a low priority deployment to keep warm nodes, and then gets evicted when those jenkins pods come in, if there is not enough capacity.

toddatapiture · 2022-11-22T14:03:41Z

@bwagner5 We haven't seen provisioning times that quickly, we are on Karpenter version 0.16.2. In regards to the custom AMI, we don't currently use any custom ones. We are trying to keep our setup as simple as possible while being reliable, secure, and low maintenance. We are consuming the EKS-optimized AMIs from AWS for AL2 amiFamily.

The instance types vary depending on the Jenkins jobs. Some require a lot of resources while others don't require much at all. I will say the more common instances are c6a.4xlarge and t3.medium. We are provisioning these in the us-east-1 region and for CNI we use the EKS add-on VPC-CNI version v1.11.3-eksbuild.1.

toddatapiture · 2022-11-22T14:14:20Z

@FernandoMiguel Thanks for the feedback! We have contemplated some ideas similar. Like having a Jenkins job that runs on a schedule to warm the nodes before a scheduled release. This way the stakeholders can go into a release and things feel quick and responsive. I feel like it could get a bit cumbersome to do that as release schedules can change and we would have the job on a cron schedule. So, unless we updated that schedule each time there could be a diff between the two.

I could add it at the beginning of a release but then it feels like it's adding an extra layer to the release and the outcome is the same. Basically, they would have to wait on the nodes to warm in that stage of the release. I am trying to abstract any delay they feel since that is their point of concern. I am welcome to any and all feedback though if I am misunderstanding.

bwagner5 · 2022-11-22T15:22:47Z

It sounds like you're facing a longer pod startup time. I suspect the node actually launches and gets to a Ready state, at least within 1 minute. I'd be very interested if that was not the case for the instance types you are using.

Check out this issue where I'm tracking some node launch latency improvements awslabs/amazon-eks-ami#1099

toddatapiture · 2022-11-22T21:20:38Z

@bwagner5 I am going to try to spend some time monitoring our Pods/Nodes. I thought in the past when I looked it was usually a few seconds for the Pods to start.

That's impressive with the EKS AMI updates - 31 seconds provisioning time 😍

github-actions · 2022-12-13T12:07:01Z

Labeled for closure due to inactivity in 10 days.

jorotg · 2023-08-11T07:24:47Z

a hack some folks use is having a low priority deployment to keep warm nodes, and then gets evicted when those jenkins pods come in, if there is not enough capacity.

@FernandoMiguel can you share how did you actually achieved it? If you did that of course.

FernandoMiguel · 2023-08-11T11:20:35Z

i dont have one at hand, if you search this repo, you will find descriptions from other folks doing exactly that

bwagner5 added the question Further information is requested label Nov 21, 2022

github-actions bot added the lifecycle/stale label Dec 13, 2022

github-actions bot added the lifecycle/closed label Dec 24, 2022

github-actions bot closed this as completed Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can nodes provision faster than 2-4 minutes? Also, can Karpenter provision additional nodes based on node resource metrics? #2906

Can nodes provision faster than 2-4 minutes? Also, can Karpenter provision additional nodes based on node resource metrics? #2906

toddatapiture commented Nov 21, 2022

bwagner5 commented Nov 21, 2022

FernandoMiguel commented Nov 22, 2022

toddatapiture commented Nov 22, 2022

toddatapiture commented Nov 22, 2022 •

edited

Loading

bwagner5 commented Nov 22, 2022

toddatapiture commented Nov 22, 2022

github-actions bot commented Dec 13, 2022

jorotg commented Aug 11, 2023

FernandoMiguel commented Aug 11, 2023

Can nodes provision faster than 2-4 minutes? Also, can Karpenter provision additional nodes based on node resource metrics? #2906

Can nodes provision faster than 2-4 minutes? Also, can Karpenter provision additional nodes based on node resource metrics? #2906

Comments

toddatapiture commented Nov 21, 2022

Is an existing page relevant?

What karpenter features are relevant?

How should the docs be improved?

Community Note

bwagner5 commented Nov 21, 2022

FernandoMiguel commented Nov 22, 2022

toddatapiture commented Nov 22, 2022

toddatapiture commented Nov 22, 2022 • edited Loading

bwagner5 commented Nov 22, 2022

toddatapiture commented Nov 22, 2022

github-actions bot commented Dec 13, 2022

jorotg commented Aug 11, 2023

FernandoMiguel commented Aug 11, 2023

toddatapiture commented Nov 22, 2022 •

edited

Loading