Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use rate limited queue #15480

Merged
merged 21 commits into from
Oct 18, 2023
Merged

Conversation

gdsoumya
Copy link
Member

@gdsoumya gdsoumya commented Sep 13, 2023

Fixes #15233

This PR adds a custom rate limiter that is a combination of a BucketLitmer and an ItemExponentialRateLimiter just like the previously used default limiter. However, the custom limiter's exponential backoff auto resets for an item if the coolDown period is over for that item. This is needed because the default exponential limiter needs a manual Forget() call to reset the backoff time but in issue #15233 which is caused due to external factors we cannot determine when Forget() should be called so we delegate it to the limiter to auto forget once the coolDown period is reached.

ENV to configure rate limits:

  1. WORKQUEUE_BUCKET_SIZE : default 500
  2. WORKQUEUE_BUCKET_QPS : default 50
  3. WORKQUEUE_FAILURE_COOLDOWN_NS: default 0s, if duration set to 0 individual rate limiting is disabled(default)
  4. WORKQUEUE_BASE_DELAY_NS: default 1000 (1μs)
  5. WORKQUEUE_MAX_DELAY_NS: default 3s (3 * 10^9)
  6. WORKQUEUE_BACKOFF_FACTOR: default 1.5

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • Optional. My organization is added to USERS.md.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Attention: 22 lines in your changes are missing coverage. Please review.

Comparison is base (c9aa373) 49.68% compared to head (4403308) 49.67%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #15480      +/-   ##
==========================================
- Coverage   49.68%   49.67%   -0.02%     
==========================================
  Files         267      267              
  Lines       46362    46387      +25     
==========================================
+ Hits        23036    23043       +7     
- Misses      21065    21084      +19     
+ Partials     2261     2260       -1     
Files Coverage Δ
controller/appcontroller.go 54.18% <82.35%> (+0.04%) ⬆️
util/env/env.go 80.99% <0.00%> (-15.09%) ⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gdsoumya gdsoumya marked this pull request as draft September 13, 2023 12:24
pkg/ratelimiter/ratelimiter.go Fixed Show fixed Hide fixed
pkg/ratelimiter/ratelimiter.go Fixed Show fixed Hide fixed
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
@gdsoumya gdsoumya marked this pull request as ready for review September 15, 2023 05:00
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
test/e2e/cluster_test.go Outdated Show resolved Hide resolved
@alexmt
Copy link
Collaborator

alexmt commented Sep 18, 2023

UI has a notable difference:

Without rate limiting:

Screen.Recording.2023-09-18.at.8.25.11.AM.mov

With rate limiting:

Screen.Recording.2023-09-18.at.8.26.19.AM.mov

Can you please check if we can tweak rate limiting params to avoid it?

@gdsoumya
Copy link
Member Author

@alexmt we change WORKQUEUE_MAX_DELAY_NS to 10^9 (1 second) or even lower down to milliseconds if we want.

@jessesuen
Copy link
Member

jessesuen commented Sep 18, 2023

In addition to reducing the max delay, I think the backoff can also less aggressive so that it is only throttling apps that have continuous, and sustained reconciles, as opposed to apps that sudden spikes in activity.

@gdsoumya
Copy link
Member Author

gdsoumya commented Sep 18, 2023

@jessesuen I don't think that can be configured directly, one field we can tune for similar results is the WORKQUEUE_BASE_DELAY_NS, the formula uses float64(r.baseDelay.Nanoseconds()) * math.Pow(2, float64(exp.failures)) so if the base delay is smaller then for small spike in syncs it will not throttle too much but even with 1ns as the base we will hit the max delay(1s) in about 30 quick re-queues.

gdsoumya and others added 3 commits September 18, 2023 22:54
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
@gdsoumya
Copy link
Member Author

@jannfis your concerns are very valid, but one thing to note - the bucket limiter is there by default which is big enough (default 500 bucket size) to handle almost all existing argocd users without changing their existing experience. The per-item limiter(exponential one) is disabled by default and can be enabled as per user requirements.

@gdsoumya gdsoumya requested a review from a team as a code owner September 25, 2023 17:06
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
gdsoumya and others added 2 commits September 26, 2023 10:15
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
@gdsoumya gdsoumya requested a review from a team as a code owner September 26, 2023 05:23
Makefile Outdated Show resolved Hide resolved
Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

To summarize this PR, what is being introduced are new controller tuning options to have a per-item rate-limited workqueue. This is disabled by default (WORKQUEUE_FAILURE_COOLDOWN_NS=0) and unless enabled, will behave exactly the same as before. But at least now, we have the ability to protect the controller when we hit situations like in #15233 where a misbehaving 3rd party controller could negatively impact the argo-cd application controller due to unthrottled reconciliations.

Now that we will have this in place, the way I see this moving forward, is for people to experiment with values in real-world setups in order to come up with a good default. Once we are comfortable with that number, in future releases we will turn this on by default with some proven values.

@jannfis are your concerns met?

Comment on lines +72 to +77
--wq-backoff-factor float Set Workqueue Per Item Rate Limiter Backoff Factor, default is 1.5 (default 1.5)
--wq-basedelay-ns duration Set Workqueue Per Item Rate Limiter Base Delay duration in nanoseconds, default 1000000 (1ms) (default 1ms)
--wq-bucket-qps int Set Workqueue Rate Limiter Bucket QPS, default 50 (default 50)
--wq-bucket-size int Set Workqueue Rate Limiter Bucket Size, default 500 (default 500)
--wq-cooldown-ns duration Set Workqueue Per Item Rate Limiter Cooldown duration in ns, default 0(per item rate limiter disabled)
--wq-maxdelay-ns duration Set Workqueue Per Item Rate Limiter Max Delay duration in nanoseconds, default 1000000000 (1s) (default 1s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we could really use a paragraph or two in docs explaining these options and how to experiment with them. If we'd like to solicit feedback from the community, we should make it super easy for them to understand how to tune these values so they can report back what they found to be effective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you think would be a good place to put the doc? A separate file under operator manual or in some exisitng file?

Copy link
Member

@jessesuen jessesuen Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best place would be along where we document other controller tuning variables:

### argocd-application-controller

I know the file is called "high availability," but there are already performance-oriented tuning variables documented and so it wouldn't be totally out of place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will add the docs for rate limiting there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR to include the docs PTAL.

Copy link
Member

@jannfis jannfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry coming back so late to this. Change overall looks good to me.

@jannfis
Copy link
Member

jannfis commented Oct 17, 2023

@jessesuen Thanks, that makes sense!

@alexmt alexmt merged commit a9f03aa into argoproj:master Oct 18, 2023
25 checks passed
ymktmk pushed a commit to ymktmk/argo-cd that referenced this pull request Oct 29, 2023
* feat: use rate limited queue

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
jmilic1 pushed a commit to jmilic1/argo-cd that referenced this pull request Nov 13, 2023
* feat: use rate limited queue

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Signed-off-by: jmilic1 <70441727+jmilic1@users.noreply.github.com>
vladfr pushed a commit to vladfr/argo-cd that referenced this pull request Dec 13, 2023
* feat: use rate limited queue

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
tesla59 pushed a commit to tesla59/argo-cd that referenced this pull request Dec 16, 2023
* feat: use rate limited queue

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Hariharasuthan99 pushed a commit to AmadeusITGroup/argo-cd that referenced this pull request Jun 16, 2024
* feat: use rate limited queue

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Argo CD should enqueue apps with AddRateLimited instead of Add
5 participants