-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable concurrency for pp and cpp #3511
Conversation
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## master #3511 +/- ##
=======================================
Coverage 52.64% 52.64%
=======================================
Files 213 213
Lines 19581 19583 +2
=======================================
+ Hits 10308 10310 +2
Misses 8721 8721
Partials 552 552
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zach593, thanks for your feedback, I think it's a reasonable update.
Signed-off-by: zach593 <zach_li@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Hi @zach593 can you help add release note to describe the update? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
I'll take a look tomorrow.
@@ -202,6 +206,8 @@ func (o *Options) AddFlags(flags *pflag.FlagSet, allControllers, disabledByDefau | |||
flags.IntVar(&o.ConcurrentResourceBindingSyncs, "concurrent-resourcebinding-syncs", 5, "The number of ResourceBindings that are allowed to sync concurrently.") | |||
flags.IntVar(&o.ConcurrentWorkSyncs, "concurrent-work-syncs", 5, "The number of Works that are allowed to sync concurrently.") | |||
flags.IntVar(&o.ConcurrentNamespaceSyncs, "concurrent-namespace-syncs", 1, "The number of Namespaces that are allowed to sync concurrently.") | |||
flags.IntVar(&o.ConcurrentPropagationPolicySyncs, "concurrent-propagation-policy-syncs", 1, "The number of PropagationPolicy that are allowed to sync concurrently.") | |||
flags.IntVar(&o.ConcurrentClusterPropagationPolicySyncs, "concurrent-cluster-propagation-policy-syncs", 1, "The number of ClusterPropagationPolicy that are allowed to sync concurrently.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do CPP and PP need to be configured separately? @XiShanYongYe-Chang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For users, the usage of cpp and pp may be different, so separate control may be a little more fine-grained.
@@ -202,6 +206,8 @@ func (o *Options) AddFlags(flags *pflag.FlagSet, allControllers, disabledByDefau | |||
flags.IntVar(&o.ConcurrentResourceBindingSyncs, "concurrent-resourcebinding-syncs", 5, "The number of ResourceBindings that are allowed to sync concurrently.") | |||
flags.IntVar(&o.ConcurrentWorkSyncs, "concurrent-work-syncs", 5, "The number of Works that are allowed to sync concurrently.") | |||
flags.IntVar(&o.ConcurrentNamespaceSyncs, "concurrent-namespace-syncs", 1, "The number of Namespaces that are allowed to sync concurrently.") | |||
flags.IntVar(&o.ConcurrentPropagationPolicySyncs, "concurrent-propagation-policy-syncs", 1, "The number of PropagationPolicy that are allowed to sync concurrently.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's slow by default, I prefer to set the default value bigger, like 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this pull request, concurrency was hardcoded as "1". I believe it would be better to set the default value to be the same as the value before the changes, so that users who did not modify this flag do not need to consider this change. If you think it's necessary, we can increase this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it makes sense to add a flag to specify the concurrency, before looking into the code, I might want to ask some questions to better understand this issue.
Recently, we encountered a slow restart issue where it took 15 minutes to clear the items in queues.
I wonder which queue we are talking about here? And How do you know it is the queue that is blocked starting? Are there some logs or metrics?
How many resource templates are in your system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to add concurrency configuration for pp and cpp
.
Hi @zach593, you can get metrics from karmada-controller-manager to prove that work queue is heavy to deal with items:
|
Yes, I totally agree that allows people the specify the concurrency. But I'm curious about the root cause of the issue:
We need to identify the root cause of the problem, develop solutions to fix it, and ultimately be able to verify it. |
this is monitoring data on grafana, for last time we restart the karmada-controller-manager. you could see pp reconciler used about 3-4 minutes to clear the queue items. pp/cpp reconciler is the upstream reconciler of the resource detector, and the output (workload template) of the upstream reconciler (pp/cpp) will be added to the downstream (resource detector) queue. Therefore, if the upstream reconciler works slowly, the downstream will need to reconcile some items twice. The same situation can also occur between detector -> rb/crb, rb/crb -> work, and the slow processing speed of the upstream will amplify the number of reconciles that the downstream needs to do. therefore, adding concurrency for pp/cpp is reasonable. This will reduce the total number of items that karmada-controller-manager needs to reconcile when restarted. |
done @XiShanYongYe-Chang |
The grafana graph is awesome! Have you tested this patch? Has the consumption speed of the item queue been improved? It would be great if we have two graphs from that we can see the difference in their effects. |
How about update like this:
|
Yes, we have tested it, but the monitoring data is not convincing enough yet. seems reconciles per minute have increase from 3-4k to 5.6k because cpu was already bursting, effect of this patch may not have been fully released. |
cool, done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Many thanks to @zach593, we can go with the default concurrency (1
), I appreciate all your effort on it. If you have any evidence that we need to increase the concurrency, please feel free to send a PR. Thanks in advance. :)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Recently, we encountered a slow restart issue where it took 15 minutes to clear the items in queues. We suspect that the number of workers in pp/cpp could be one of the reasons for this problem. Therefore, it would be reasonable to add a concurrency configuration for this.
Although the pp/cpp reconcile function appears to be capable of handling concurrency, the number of workers is hardcoded as
1
. Please let me know if there is any particular reason for this.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: