Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can nodes provision faster than 2-4 minutes? Also, can Karpenter provision additional nodes based on node resource metrics? #2906

Closed
toddatapiture opened this issue Nov 21, 2022 · 9 comments
Labels
lifecycle/closed lifecycle/stale question Further information is requested

Comments

@toddatapiture
Copy link

Is an existing page relevant?

N/A

What karpenter features are relevant?

We have had great success with Karpenter and really enjoy this project for our platforms. I want to ask a question about node provisioning times. Also, can nodes be provisioned based on node resource metrics (CPU and Memory)?

For some services, we have created this hasn't been an issue. We create a helm chart for our services and deploy them onto the EKS cluster as a deployment. We are also using HPA (horizontal pod autoscaling) so once a deployment Pod reaches a threshold (resource metric) we create more replicas of that Pod. This then triggers a scaling event in Karpenter to provision new nodes so these Pods can get scheduled to them. It also works fluently as well because we can start the provisioning process at say 70% of the resources on the Pod so there's still a little cushion on the Pod while Pods and Nodes are scaling. This works great and as expected.

The issue we are encountering is we are migrating our Jenkins agents to K8s and using Karpenter to scale our agents. We don't have deployments for the agents so we don't have HPA. Each Jenkins job itself creates a Pod (via Pod definition) and those Pods are scheduled onto Karpenter provisioned nodes. This isn't too bad during the day under high traffic. But, first thing in the mornings or late at night during releases we have no nodes running. So when a Jenkins job is started it creates a build (Pod) for that Jenkins job and then a Karpenter scaling event happens. During this event, it takes 2-4 to provision a node, and some of the Jenkins jobs utilized for releases only take 30 seconds to run. There's been pushback from stakeholders as the provisioning is taking longer than the actual release task (run of Jenkins job). It's not a lot of time but when you watch the console output it can feel like a lot of time waiting 2-4 minutes.

The other thing is new nodes don't get provisioned until Pods are un-schedulable with our Jenkins setup. As mentioned above if we could utilize something like HPA we could start scaling the Pods in advance to prepare node provision in advance. This would be a fluid flow and seamless. But, the way it currently works we have to hit our limitation wall first then schedule second. Is there any way of having Karpenter provision new nodes in unique instances like this based on node resource metrics (CPU /Memory)? If the Node is as x % go ahead and scale.

Any help or feedback would be greatly appreciated!

How should the docs be improved?

N/A

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@bwagner5
Copy link
Contributor

Are you using a custom AMI with user-data? We usually see nodes provisioned in at least 55 secs today and often 45-50 seconds.

We are working on improving the time it takes nodes to provision which we will be releasing soon. More info about the AMI, user-data, and node initialization (what instance types, CNI, region, etc) may help us provide recommendations to expedite the provisioning.

@bwagner5 bwagner5 added the question Further information is requested label Nov 21, 2022
@FernandoMiguel
Copy link
Contributor

a hack some folks use is having a low priority deployment to keep warm nodes, and then gets evicted when those jenkins pods come in, if there is not enough capacity.

@toddatapiture
Copy link
Author

@bwagner5 We haven't seen provisioning times that quickly, we are on Karpenter version 0.16.2. In regards to the custom AMI, we don't currently use any custom ones. We are trying to keep our setup as simple as possible while being reliable, secure, and low maintenance. We are consuming the EKS-optimized AMIs from AWS for AL2 amiFamily.

The instance types vary depending on the Jenkins jobs. Some require a lot of resources while others don't require much at all. I will say the more common instances are c6a.4xlarge and t3.medium. We are provisioning these in the us-east-1 region and for CNI we use the EKS add-on VPC-CNI version v1.11.3-eksbuild.1.

@toddatapiture
Copy link
Author

toddatapiture commented Nov 22, 2022

@FernandoMiguel Thanks for the feedback! We have contemplated some ideas similar. Like having a Jenkins job that runs on a schedule to warm the nodes before a scheduled release. This way the stakeholders can go into a release and things feel quick and responsive. I feel like it could get a bit cumbersome to do that as release schedules can change and we would have the job on a cron schedule. So, unless we updated that schedule each time there could be a diff between the two.

I could add it at the beginning of a release but then it feels like it's adding an extra layer to the release and the outcome is the same. Basically, they would have to wait on the nodes to warm in that stage of the release. I am trying to abstract any delay they feel since that is their point of concern. I am welcome to any and all feedback though if I am misunderstanding.

@bwagner5
Copy link
Contributor

It sounds like you're facing a longer pod startup time. I suspect the node actually launches and gets to a Ready state, at least within 1 minute. I'd be very interested if that was not the case for the instance types you are using.

Check out this issue where I'm tracking some node launch latency improvements awslabs/amazon-eks-ami#1099

@toddatapiture
Copy link
Author

@bwagner5 I am going to try to spend some time monitoring our Pods/Nodes. I thought in the past when I looked it was usually a few seconds for the Pods to start.

That's impressive with the EKS AMI updates - 31 seconds provisioning time 😍

@github-actions
Copy link
Contributor

Labeled for closure due to inactivity in 10 days.

@jorotg
Copy link

jorotg commented Aug 11, 2023

a hack some folks use is having a low priority deployment to keep warm nodes, and then gets evicted when those jenkins pods come in, if there is not enough capacity.

@FernandoMiguel can you share how did you actually achieved it? If you did that of course.

@FernandoMiguel
Copy link
Contributor

i dont have one at hand, if you search this repo, you will find descriptions from other folks doing exactly that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/closed lifecycle/stale question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants