Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop mechanism to overcome 429 "hits the concurrent operations quota" #15579

Closed
tikolsky opened this issue Aug 21, 2023 · 11 comments · Fixed by GoogleCloudPlatform/magic-modules#8828, hashicorp/terraform-provider-google-beta#6254 or #15820

Comments

@tikolsky
Copy link

tikolsky commented Aug 21, 2023

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment. If the issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If the issue is assigned to a user, that user is claiming responsibility for the issue. If the issue is assigned to "hashibot", a community member has claimed the issue already.

Description

While using terraform ("~> 1.4.6"), and google terraform provider (v4.78.0), I hit the following error many times:
11:50:40 │ Error: error creating NodePool: googleapi: Error 429: Too many operations: cluster "clustername" hits the concurrent operations quota 5, please try again later. 11:50:40 │ Details: 11:50:40 │ [ 11:50:40 │ { 11:50:40 │ "@type": "type.googleapis.com/google.rpc.RequestInfo", 11:50:40 │ "requestId": "*********" 11:50:40 │ } 11:50:40 │ ] 11:50:40 │ , rateLimitExceeded 11:50:40 │ 11:50:40 │ with google_container_node_pool.workload-node-pool[2], 11:50:40 │ on main.tf line 122, in resource "google_container_node_pool" "workload-node-pool": 11:50:40 │ 122: resource "google_container_node_pool" "workload-node-pool" {

I can see this error a lot in my Audit logs:
{ "protoPayload": { "@type": "type.googleapis.com/google.cloud.audit.AuditLog", "status": { "code": 8, "message": "Too many operations: cluster \"****\" hits the concurrent operations quota 5, please try again later." }, "authenticationInfo": { "principalEmail": "****", "serviceAccountKeyName": "****", "principalSubject": "****" }, "requestMetadata": { "callerIp": "****", "callerSuppliedUserAgent": "google-api-go-client/0.5 Terraform/1.3.7 (+https://www.terraform.io) Terraform-Plugin-SDK/2.10.1 terraform-provider-google/4.78.0,gzip(gfe)", "requestAttributes": { "time": "2023-08-20T16:49:31.313550903Z", "auth": {} }, "destinationAttributes": {} }, "serviceName": "container.googleapis.com", "methodName": "google.container.v1.ClusterManager.CreateNodePool", "authorizationInfo": [ { "permission": "container.clusters.update", "granted": true, "resourceAttributes": {} } ], "resourceName": "projects/****/zones/****/clusters/****/nodePools/****", "request": { "parent": "projects/****/locations/****/clusters/****", "@type": "type.googleapis.com/google.container.v1alpha1.CreateNodePoolRequest", "nodePool": { "name": "****", "config": { "imageType": "UBUNTU_CONTAINERD", "labels": { "aquanode": "****" }, "diskType": "pd-ssd", "loggingConfig": {}, "machineType": "n2-standard-32", "diskSizeGb": 100, "oauthScopes": [ "https://www.googleapis.com/auth/devstorage.read_only" ] }, "initialNodeCount": 1, "networkConfig": {} } }, "response": { "@type": "type.googleapis.com/google.container.v1alpha1.Operation" }, "resourceLocation": { "currentLocations": [ "****" ] }, "policyViolationInfo": { "orgPolicyViolationInfo": {} } }, "insertId": "****", "resource": { "type": "gke_nodepool", "labels": { "location": "****", "cluster_name": "****", "nodepool_name": "****", "project_id": "****" } }, "timestamp": "2023-08-20T16:49:31.923885772Z", "severity": "ERROR", "logName": "projects/****logs/cloudaudit.googleapis.com%2Factivity", "receiveTimestamp": "2023-08-20T16:49:32.282783167Z" }

New or Affected Resource(s)

  • google_container_cluster
  • google_container_node_pool

References

(#15205)

b/298050622

@github-actions github-actions bot added forward/review In review; remove label to forward service/container labels Aug 21, 2023
@dragon4eg
Copy link

dragon4eg commented Aug 23, 2023

+1, but as i says it has some limit at 5 (and turned out it performed those first 5 changes, at least in my case)
try to apply several times, it worked for me... this time.
but the issue really is new, didn't observe such behaviour before (i'm using the module for a few months now, definitely made more than 5 changes at once before)
│ Error: googleapi: Error 429: Too many operations: cluster "my-cluster" hits the concurrent operations quota 5, please try again later.

│ , rateLimitExceeded

│ with module.gke.google_container_node_pool.pools["my-pool"],
│ on .terraform/modules/gke/modules/private-cluster/cluster.tf line 383, in resource "google_container_node_pool" "pools":
│ 383: resource "google_container_node_pool" "pools" {

╵.......

@mzylowski
Copy link

+1
Same issue from my side - it just started appearing and applying again after failure helps. Do you have any idea how to mitigate this issue to perform terraform apply in one run like before? Is there any quota value that can be changed from GCP panel?

@ScottSuarez
Copy link
Collaborator

Note for the service team:

We might need to introduce some sort of mutex or batching if we can only have 5 concurrent operations.

@ScottSuarez ScottSuarez added bug and removed enhancement forward/review In review; remove label to forward labels Aug 25, 2023
@KatrinaHoffert
Copy link

KatrinaHoffert commented Aug 28, 2023

Hi, replying here from the team behind this change. The retrying of operations is intentional, as we've recently made improvements such that most node pool operations can now be performed concurrently. But rather than try to rate limit client side or duplicate the complicated logic for determining which operations can be concurrent with other operations, we opted to have Terraform simply retry the creation over and over until it eventually succeeds. In other words, it's like there is a mutex but it's server side.

This isn't supposed to cause overall TF apply failures, though. I believe what's happening is that it's hitting the timeout for overall creation, which defaults to 30 minutes but can be overridden in your config with a standard field (https://developer.hashicorp.com/terraform/language/resources/syntax#operation-timeouts). That said, there is a bug for us to fix here (or at least an unintentional change for us to decide how we should handle). Previously (with the one operation per cluster limit), the deadline started after the client side mutex, but now it starts after. That also means some of the deadline is consumed by this period of waiting for operations, which was unintentional. Until we can address this, I suggest increasing the timeouts.create (etc) so that this won't timeout.

In some cases, this error may be a red herring due to a different resource also failing. e.g., with the current quota of 5 operations per cluster, you could create 10 node pools. Only 5 would actually start creating and the other 5 would be blocked waiting for the first 5 to finish. If those first 5 take too long to finish, the pending 5 could timeout first, causing the error to mention the pending 5 even though the 5 that were being created (but are taking too long) are the real issue. Sufficiently large node pools failing to become healthy or stockouts could cause this. Bumping the timeout would also help if this is the case.

For most users, Terraform apply shouldn't fail. If you simply see these HTTP 429 errors in the logs but everything overall succeeds, that's working as intended. Since the operations can now be done concurrently instead of having to be sequential (the previous status quo), the overall time to completion should be lower.

We also intend to increase the default concurrency quota soon (which is one reason we didn't limit this client side), with a September ETA. That should also significantly help with this as well as further improving time to completion.

As an aside, the --parallelism standard Terraform flag may also be helpful (defaults to 10, setting it to 5 is likely an alternative workaround for this issue). That is across all providers, so is unfortunately a bit broader than necessary (which is to say, it can rate limit reconciliation of other resources besides GKE node pools).

@ScottSuarez
Copy link
Collaborator

@KatrinaHoffert, we can increase the default timeout on these resources for what it's worth if you think that is a good fix in the interim. But I'm not sure what a good value would be. Could someone from your team help to make this change?

@KatrinaHoffert
Copy link

KatrinaHoffert commented Aug 28, 2023

Actually, while the above is an issue while there's op conflicts, after having more details provided about an instance of this, I've realized the more immediate issue is that the error checking in https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/container/resource_container_node_pool.go#L520 handles only the case of operation conflicts (which are a "failed precondition" error). The quota errors, on the other hand, are "resource exhausted" (HTTP 429), so are not being retried when they're meant to be.

I have someone on my team looking into this and have escalated our internal ticket. The --parallelism 5 (or less if you want to account for the possibility of ops that aren't TF initiated) workaround is expected to work, but the timeout workaround I previously mentioned won't work for this case.

@ShaiShalevSQream
Copy link

ShaiShalevSQream commented Sep 3, 2023

We have a quota of 1. how can we increase it?
so we wont use terraform "parallelism=1" ?

example:
deleting NodePool: googleapi: Error 429: Too many operations: cluster "cluster-name" hits the concurrent operations quota 1, please try again later.

@matthew-fawcett
Copy link

I have someone on my team looking into this and have escalated our internal ticket. The --parallelism 5 (or less if you want to account for the possibility of ops that aren't TF initiated) workaround is expected to work, but the timeout workaround I previously mentioned won't work for this case.

5 doesn't work, only 1 does which slows down deployments to an unacceptable level.

Is there any movement on the linked PR?

@KatrinaHoffert
Copy link

We have a quota of 1

My apologies, that shouldn't have happened. A bad flag rollout that was intended to increase the quota resulted in a temporary period of falling back to an unintended fallback value, which tragically makes this issue worse. We'll have a fix for that rolling out shortly and is expected to finish before EOD. This is likely also why there's a mention that --parallelism 1 was needed (and shouldn't be needed shortly as a result).

@slimatic
Copy link

We have a quota of 1

My apologies, that shouldn't have happened. A bad flag rollout that was intended to increase the quota resulted in a temporary period of falling back to an unintended fallback value, which tragically makes this issue worse. We'll have a fix for that rolling out shortly and is expected to finish before EOD. This is likely also why there's a mention that --parallelism 1 was needed (and shouldn't be needed shortly as a result).

Hi @KatrinaHoffert has the fix for this issue already been rolled out?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.