Controller CPU Utilization #3752
-
I'm evaluating using Argo Rollouts for my organization. If we go through the full migration, it would likely equal around 2,000 rollout objects in a single cluster. In running some scaling tests, I found some concerns with CPU throttling on the argo rollouts controller. When running
I appreciate any context here. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 5 replies
-
Beta Was this translation helpful? Give feedback.
-
Hello The first 2 questions can only be answered by looking at the source code. For the third one I added some clarifications here #3529 |
Beta Was this translation helpful? Give feedback.
-
I just explained here that the recommendation for a short release duration wasn't about resources #3753 |
Beta Was this translation helpful? Give feedback.
-
The prometheus code is here https://github.com/argoproj/argo-rollouts/blob/master/metricproviders/prometheus/prometheus.go But frankly I think running Argo Rollouts with a profiler might be a better idea, as the bottleneck might be somewhere else and not in metrics. Out of curiosity, do you really need 2000 Rollout objects in a single cluster? Are these 2000 unique applications that need progressive delivery and your developers create new versions all the time? Are they in the same namespace or different namespaces? How many rollouts are actually under deployment at any given time? |
Beta Was this translation helpful? Give feedback.
-
Thank you for all the clarifications on question 3.
Great question. Our two largest clusters currently run about 1,200 deployments at any given time with more teams expected to migrate their workloads. We are working through other scaling issues at the time that will involve using multiple smaller clusters, however in the short term we need more teams using our current clusters.
We run large multi-tenant clusters with unique applications per namespace. All those applications need to migrate to use some form of progressive delivery. Application teams for the most part aren't going to be constantly running new versions. We don't have hard numbers, however, I would expect around 50 rollouts could be occurring at any given time. My goal with running the larger mass rollouts was to load test if the controller and our prometheus instances can withstand mass rollouts in any form and what impact that has on application teams. |
Beta Was this translation helpful? Give feedback.
-
I added a future enhancement here. No solid ETA at the moment though #3757 |
Beta Was this translation helpful? Give feedback.
-
@johnmwood also feel free to reach out to me in CNCF slack if you are interested in adding pprof support or anything like that, my username: |
Beta Was this translation helpful? Give feedback.
Hello
The first 2 questions can only be answered by looking at the source code.
For the third one I added some clarifications here #3529
I am the author of that recommendation and it has nothing to do with resource constraints.