Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENHANCE_YOUR_CALM error for plugins that take >45 seconds #15656

Closed
BhavikaSharma opened this issue Sep 25, 2023 · 9 comments · Fixed by #15806
Closed

ENHANCE_YOUR_CALM error for plugins that take >45 seconds #15656

BhavikaSharma opened this issue Sep 25, 2023 · 9 comments · Fixed by #15806
Labels
bug Something isn't working

Comments

@BhavikaSharma
Copy link
Contributor

Checklist:

  • [X ] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [ X] I've included steps to reproduce the bug.
  • [ X] I've pasted the output of argocd version.

Describe the bug

When creating an application that runs a custom plugin that runs for over ~45 seconds, the user is returned an ENHANCE_YOUR_CALM error.

This happens on occasion for plugins > 45s and consistently for plugins > 80s.

To Reproduce

Create a CMP that runs for a long time (e.g. run a script that includes a sleep 90).

Run:

argocd app create test-app --file create.json --grpc-web

where create.json is the definition of the app that uses the long-running CMP.

Expected behavior

App is created with no error.

Screenshots

N/A

Version

argocd: v2.8.4+c279299.dirty
  BuildDate: 2023-09-13T22:00:14Z
  GitCommit: c27929928104dc37b937764baf65f38b78930e59
  GitTreeState: dirty
  GoVersion: go1.21.1
  Compiler: gc
  Platform: darwin/amd64

Logs

2023/09/23 16:45:34 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
FATA[0080] rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings"
@BhavikaSharma BhavikaSharma added the bug Something isn't working label Sep 25, 2023
@msuthar-splunk
Copy link
Contributor

msuthar-splunk commented Sep 27, 2023

This is a feature conflict. The CMP timeout is configurable with a higher default value (60s). The argocd app create/patch commands run validation which will run CMP before applying the change.

The GRPCKeepAliveEnforcementMinimum for argocd CLI isn't configurable and it is lower than the CMP timeout.

The CLI commands fail intermittently when the CMP takes more than 30 seconds.
The CLI commands always fail when the CMP takes more than 30 seconds.

The GRPCKeepAliveEnforcementMinimum value here should be configurable as a CLI parameter and/or an environment variable.

Happy to make a contribution to this if the community agrees on the use-case and the solution.

@crenshaw-dev
Copy link
Member

That makes sense to me!

@terrytangyuan
Copy link
Member

Making it an env variable sounds good! 👍🏻

@renperez-cpi
Copy link

@crenshaw-dev is there any update on this?

@phanama
Copy link
Contributor

phanama commented Oct 12, 2023

The CMP timeout is configurable with a higher default value (60s).
GRPCKeepAliveEnforcementMinimum for argocd CLI isn't configurable and it is lower than the CMP timeout.

I think this is not related to GRPCKeepAliveEnforcementMinimum being lower than the CMP timeout. The grpc keepalive settings govern how clients send pings and how servers expect clients to send their pings in a grpc connection (ref).
As long as the client-side grpc keepalive setting is in agreement with server-side settings (i.e. the client's ping interval shouldn't be lower than server's ping recieve interval ref), the connection should be fine and clients shouldn't get http2 GOAWAY error.

This seems to be related to #15707 instead. The CLI is also using the same apiclient package, likely suffering from the same issue.
In argocd, the grpc keepalive setting has been hardcoded into client-server agreement since #9922. However, the server setting is still missing from the local grpc-proxy if we use GrpcWeb=true. This means clients use the hardcoded setting 20s, while the local grpcweb's proxy server uses the package default setting of 5m, making clients' ping interval way lower than the server's which potentially results in the GOAWAY error.

...

GRPCKeepAliveEnforcementMinimum value here should be configurable as a CLI parameter and/or an environment variable

In theory, making it configurable may prolong the time for the GOAWAY error to be thrown (for example using 30s translates to 30s * 4 = 120s), thus alleviating the problem. However, it might not directly address the root cause. 🤔

While it also offers more configuration options, for example if we want to reduce grpc keepalive pings to the ArgoCD server in case the pings start to impair server performance (which I think is unlikely as ArgoCD is not serving millions of mobile clients), it may actually make the system more complex for users as they need to correctly coordinate the values between the client (including external 3rd party API clients) and the server (and the work behind the server like cmp plugins) according to the grpc keepalive spec.

I think keeping it hardcoded provides the simpler approach.

ishitasequeira added a commit that referenced this issue Oct 18, 2023
)

* Add ENV variables to configure GRPC Keep Alive Time

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Retrigger CI pipeline

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Resolve conflict with master

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Update docs/user-guide/environment-variables.md

Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>

---------

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>
Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
ymktmk pushed a commit to ymktmk/argo-cd that referenced this issue Oct 29, 2023
…) (argoproj#15806)

* Add ENV variables to configure GRPC Keep Alive Time

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Retrigger CI pipeline

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Resolve conflict with master

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Update docs/user-guide/environment-variables.md

Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>

---------

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>
Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
jmilic1 pushed a commit to jmilic1/argo-cd that referenced this issue Nov 13, 2023
…) (argoproj#15806)

* Add ENV variables to configure GRPC Keep Alive Time

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Retrigger CI pipeline

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Resolve conflict with master

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Update docs/user-guide/environment-variables.md

Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>

---------

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>
Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: jmilic1 <70441727+jmilic1@users.noreply.github.com>
vladfr pushed a commit to vladfr/argo-cd that referenced this issue Dec 13, 2023
…) (argoproj#15806)

* Add ENV variables to configure GRPC Keep Alive Time

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Retrigger CI pipeline

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Resolve conflict with master

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Update docs/user-guide/environment-variables.md

Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>

---------

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>
Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
tesla59 pushed a commit to tesla59/argo-cd that referenced this issue Dec 16, 2023
…) (argoproj#15806)

* Add ENV variables to configure GRPC Keep Alive Time

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Retrigger CI pipeline

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Resolve conflict with master

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Update docs/user-guide/environment-variables.md

Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>

---------

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>
Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
@robertLichtnow
Copy link

For anyone lurking around this issue, like I was, upgrading the server to anything over 2.9.0 and the client to anything over 2.10.0 (even RC works here) and setting the value for the ARGOCD_GRPC_KEEP_ALIVE_MIN environment value as the same thing on both ends is what fixed it.

alexmt pushed a commit to alexmt/argo-cd that referenced this issue Jan 19, 2024
…) (argoproj#15806)

* Add ENV variables to configure GRPC Keep Alive Time

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Retrigger CI pipeline

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Resolve conflict with master

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Update docs/user-guide/environment-variables.md

Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>

---------

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>
Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
@audrey-mux
Copy link

For anyone lurking around this issue, like I was, upgrading the server to anything over 2.9.0 and the client to anything over 2.10.0 (even RC works here) and setting the value for the ARGOCD_GRPC_KEEP_ALIVE_MIN environment value as the same thing on both ends is what fixed it.

You're setting that var on the argocd-server and argocd-repo-server pods?

@audrey-mux
Copy link

meh, this still doesn't work with client 2.10.rc and a 2.9.5 server. Command times out at 62 seconds each time.

@ArieLevs
Copy link

ArieLevs commented Feb 17, 2024

@audrey-mux had exactly same issue, api calls to argocd server resulted with the current issue error (too_many_pings) after exactly 60 seconds,
after trying the proposed solution from @robertLichtnow, setting server env with

- name: ARGOCD_GRPC_KEEP_ALIVE_MIN
  value: "30s"

and executing the local terminal session with ARGOCD_GRPC_KEEP_ALIVE_MIN="30s" totally solved it, try checking the server pod for logs as you might getting a different error.
although i'm yet sure as to why this step is even needed, and not auto pre configured as it just adds extra complexity.

server: v2.8.9
client: v2.10.1

  • Update
    Seems that the too_many_pings errors might be a generic response if the server is not able to successfully execute what evert it was asked to do in its timeout config.
    while getting the received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" in client logs i was actually getting a 403 error on server side, while trying to perform cluster add operation, at least after setting the server/cleint to same keep alive value got a "normal" response with POST https://argocd.server.addr:443/cluster.ClusterService/Update failed with status code 504 to the client (its the 403 error on server side)

hope this helps somehow

Hariharasuthan99 pushed a commit to AmadeusITGroup/argo-cd that referenced this issue Jun 16, 2024
…) (argoproj#15806)

* Add ENV variables to configure GRPC Keep Alive Time

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Retrigger CI pipeline

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Resolve conflict with master

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>

* Update docs/user-guide/environment-variables.md

Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>

---------

Signed-off-by: Bhavika Sharma <bsharma@splunk.com>
Signed-off-by: BhavikaSharma <BhavikaSharma@users.noreply.github.com>
Co-authored-by: Ishita Sequeira <46771830+ishitasequeira@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants