Develop mechanism to overcome 429 "hits the concurrent operations quota" #15579

tikolsky · 2023-08-21T17:07:39Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment. If the issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If the issue is assigned to a user, that user is claiming responsibility for the issue. If the issue is assigned to "hashibot", a community member has claimed the issue already.

Description

While using terraform ("~> 1.4.6"), and google terraform provider (v4.78.0), I hit the following error many times:
11:50:40 │ Error: error creating NodePool: googleapi: Error 429: Too many operations: cluster "clustername" hits the concurrent operations quota 5, please try again later. 11:50:40 │ Details: 11:50:40 │ [ 11:50:40 │ { 11:50:40 │ "@type": "type.googleapis.com/google.rpc.RequestInfo", 11:50:40 │ "requestId": "*********" 11:50:40 │ } 11:50:40 │ ] 11:50:40 │ , rateLimitExceeded 11:50:40 │ 11:50:40 │ with google_container_node_pool.workload-node-pool[2], 11:50:40 │ on main.tf line 122, in resource "google_container_node_pool" "workload-node-pool": 11:50:40 │ 122: resource "google_container_node_pool" "workload-node-pool" {

I can see this error a lot in my Audit logs:
{ "protoPayload": { "@type": "type.googleapis.com/google.cloud.audit.AuditLog", "status": { "code": 8, "message": "Too many operations: cluster \"****\" hits the concurrent operations quota 5, please try again later." }, "authenticationInfo": { "principalEmail": "****", "serviceAccountKeyName": "****", "principalSubject": "****" }, "requestMetadata": { "callerIp": "****", "callerSuppliedUserAgent": "google-api-go-client/0.5 Terraform/1.3.7 (+https://www.terraform.io) Terraform-Plugin-SDK/2.10.1 terraform-provider-google/4.78.0,gzip(gfe)", "requestAttributes": { "time": "2023-08-20T16:49:31.313550903Z", "auth": {} }, "destinationAttributes": {} }, "serviceName": "container.googleapis.com", "methodName": "google.container.v1.ClusterManager.CreateNodePool", "authorizationInfo": [ { "permission": "container.clusters.update", "granted": true, "resourceAttributes": {} } ], "resourceName": "projects/****/zones/****/clusters/****/nodePools/****", "request": { "parent": "projects/****/locations/****/clusters/****", "@type": "type.googleapis.com/google.container.v1alpha1.CreateNodePoolRequest", "nodePool": { "name": "****", "config": { "imageType": "UBUNTU_CONTAINERD", "labels": { "aquanode": "****" }, "diskType": "pd-ssd", "loggingConfig": {}, "machineType": "n2-standard-32", "diskSizeGb": 100, "oauthScopes": [ "https://www.googleapis.com/auth/devstorage.read_only" ] }, "initialNodeCount": 1, "networkConfig": {} } }, "response": { "@type": "type.googleapis.com/google.container.v1alpha1.Operation" }, "resourceLocation": { "currentLocations": [ "****" ] }, "policyViolationInfo": { "orgPolicyViolationInfo": {} } }, "insertId": "****", "resource": { "type": "gke_nodepool", "labels": { "location": "****", "cluster_name": "****", "nodepool_name": "****", "project_id": "****" } }, "timestamp": "2023-08-20T16:49:31.923885772Z", "severity": "ERROR", "logName": "projects/****logs/cloudaudit.googleapis.com%2Factivity", "receiveTimestamp": "2023-08-20T16:49:32.282783167Z" }

New or Affected Resource(s)

google_container_cluster
google_container_node_pool

References

(#15205)

Fail to delete/create node pools #15205

b/298050622

The text was updated successfully, but these errors were encountered:

dragon4eg · 2023-08-23T13:48:50Z

+1, but as i says it has some limit at 5 (and turned out it performed those first 5 changes, at least in my case)
try to apply several times, it worked for me... this time.
but the issue really is new, didn't observe such behaviour before (i'm using the module for a few months now, definitely made more than 5 changes at once before)
│ Error: googleapi: Error 429: Too many operations: cluster "my-cluster" hits the concurrent operations quota 5, please try again later.

│ , rateLimitExceeded
│
│ with module.gke.google_container_node_pool.pools["my-pool"],
│ on .terraform/modules/gke/modules/private-cluster/cluster.tf line 383, in resource "google_container_node_pool" "pools":
│ 383: resource "google_container_node_pool" "pools" {
│
╵.......

mzylowski · 2023-08-23T14:34:04Z

+1
Same issue from my side - it just started appearing and applying again after failure helps. Do you have any idea how to mitigate this issue to perform terraform apply in one run like before? Is there any quota value that can be changed from GCP panel?

ScottSuarez · 2023-08-25T22:24:25Z

Note for the service team:

We might need to introduce some sort of mutex or batching if we can only have 5 concurrent operations.

KatrinaHoffert · 2023-08-28T19:03:41Z

Hi, replying here from the team behind this change. The retrying of operations is intentional, as we've recently made improvements such that most node pool operations can now be performed concurrently. But rather than try to rate limit client side or duplicate the complicated logic for determining which operations can be concurrent with other operations, we opted to have Terraform simply retry the creation over and over until it eventually succeeds. In other words, it's like there is a mutex but it's server side.

This isn't supposed to cause overall TF apply failures, though. I believe what's happening is that it's hitting the timeout for overall creation, which defaults to 30 minutes but can be overridden in your config with a standard field (https://developer.hashicorp.com/terraform/language/resources/syntax#operation-timeouts). That said, there is a bug for us to fix here (or at least an unintentional change for us to decide how we should handle). Previously (with the one operation per cluster limit), the deadline started after the client side mutex, but now it starts after. That also means some of the deadline is consumed by this period of waiting for operations, which was unintentional. Until we can address this, I suggest increasing the timeouts.create (etc) so that this won't timeout.

In some cases, this error may be a red herring due to a different resource also failing. e.g., with the current quota of 5 operations per cluster, you could create 10 node pools. Only 5 would actually start creating and the other 5 would be blocked waiting for the first 5 to finish. If those first 5 take too long to finish, the pending 5 could timeout first, causing the error to mention the pending 5 even though the 5 that were being created (but are taking too long) are the real issue. Sufficiently large node pools failing to become healthy or stockouts could cause this. Bumping the timeout would also help if this is the case.

For most users, Terraform apply shouldn't fail. If you simply see these HTTP 429 errors in the logs but everything overall succeeds, that's working as intended. Since the operations can now be done concurrently instead of having to be sequential (the previous status quo), the overall time to completion should be lower.

We also intend to increase the default concurrency quota soon (which is one reason we didn't limit this client side), with a September ETA. That should also significantly help with this as well as further improving time to completion.

As an aside, the --parallelism standard Terraform flag may also be helpful (defaults to 10, setting it to 5 is likely an alternative workaround for this issue). That is across all providers, so is unfortunately a bit broader than necessary (which is to say, it can rate limit reconciliation of other resources besides GKE node pools).

ScottSuarez · 2023-08-28T19:59:44Z

@KatrinaHoffert, we can increase the default timeout on these resources for what it's worth if you think that is a good fix in the interim. But I'm not sure what a good value would be. Could someone from your team help to make this change?

KatrinaHoffert · 2023-08-28T20:18:32Z

Actually, while the above is an issue while there's op conflicts, after having more details provided about an instance of this, I've realized the more immediate issue is that the error checking in https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/container/resource_container_node_pool.go#L520 handles only the case of operation conflicts (which are a "failed precondition" error). The quota errors, on the other hand, are "resource exhausted" (HTTP 429), so are not being retried when they're meant to be.

I have someone on my team looking into this and have escalated our internal ticket. The --parallelism 5 (or less if you want to account for the possibility of ops that aren't TF initiated) workaround is expected to work, but the timeout workaround I previously mentioned won't work for this case.

ShaiShalevSQream · 2023-09-03T06:44:46Z

We have a quota of 1. how can we increase it?
so we wont use terraform "parallelism=1" ?

example:
deleting NodePool: googleapi: Error 429: Too many operations: cluster "cluster-name" hits the concurrent operations quota 1, please try again later.

matthew-fawcett · 2023-09-05T14:16:00Z

I have someone on my team looking into this and have escalated our internal ticket. The --parallelism 5 (or less if you want to account for the possibility of ops that aren't TF initiated) workaround is expected to work, but the timeout workaround I previously mentioned won't work for this case.

5 doesn't work, only 1 does which slows down deployments to an unacceptable level.

Is there any movement on the linked PR?

KatrinaHoffert · 2023-09-05T16:45:33Z

We have a quota of 1

My apologies, that shouldn't have happened. A bad flag rollout that was intended to increase the quota resulted in a temporary period of falling back to an unintended fallback value, which tragically makes this issue worse. We'll have a fix for that rolling out shortly and is expected to finish before EOD. This is likely also why there's a mention that --parallelism 1 was needed (and shouldn't be needed shortly as a result).

slimatic · 2023-09-12T14:45:49Z

We have a quota of 1

My apologies, that shouldn't have happened. A bad flag rollout that was intended to increase the quota resulted in a temporary period of falling back to an unintended fallback value, which tragically makes this issue worse. We'll have a fix for that rolling out shortly and is expected to finish before EOD. This is likely also why there's a mention that --parallelism 1 was needed (and shouldn't be needed shortly as a result).

Hi @KatrinaHoffert has the fix for this issue already been rolled out?

github-actions · 2023-10-13T02:02:56Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tikolsky added the enhancement label Aug 21, 2023

github-actions bot added forward/review In review; remove label to forward service/container labels Aug 21, 2023

ScottSuarez added bug and removed enhancement forward/review In review; remove label to forward labels Aug 25, 2023

roaks3 added the forward/linked label Aug 29, 2023

cindyzhengjw mentioned this issue Aug 31, 2023

Node pool operations should retry if they encountered quota error. GoogleCloudPlatform/magic-modules#8828

Merged

5 tasks

ScottSuarez closed this as completed in GoogleCloudPlatform/magic-modules#8828 Sep 12, 2023

This was referenced Sep 12, 2023

Node pool operations should retry if they encountered quota error. hashicorp/terraform-provider-google-beta#6254

Merged

Node pool operations should retry if they encountered quota error. #15820

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop mechanism to overcome 429 "hits the concurrent operations quota" #15579

Develop mechanism to overcome 429 "hits the concurrent operations quota" #15579

tikolsky commented Aug 21, 2023 •

edited by roaks3

Loading

dragon4eg commented Aug 23, 2023 •

edited

Loading

mzylowski commented Aug 23, 2023

ScottSuarez commented Aug 25, 2023

KatrinaHoffert commented Aug 28, 2023 •

edited

Loading

ScottSuarez commented Aug 28, 2023

KatrinaHoffert commented Aug 28, 2023 •

edited

Loading

ShaiShalevSQream commented Sep 3, 2023 •

edited

Loading

matthew-fawcett commented Sep 5, 2023

KatrinaHoffert commented Sep 5, 2023

slimatic commented Sep 12, 2023

github-actions bot commented Oct 13, 2023

Develop mechanism to overcome 429 "hits the concurrent operations quota" #15579

Develop mechanism to overcome 429 "hits the concurrent operations quota" #15579

Comments

tikolsky commented Aug 21, 2023 • edited by roaks3 Loading

Community Note

Description

New or Affected Resource(s)

References

dragon4eg commented Aug 23, 2023 • edited Loading

mzylowski commented Aug 23, 2023

ScottSuarez commented Aug 25, 2023

KatrinaHoffert commented Aug 28, 2023 • edited Loading

ScottSuarez commented Aug 28, 2023

KatrinaHoffert commented Aug 28, 2023 • edited Loading

ShaiShalevSQream commented Sep 3, 2023 • edited Loading

matthew-fawcett commented Sep 5, 2023

KatrinaHoffert commented Sep 5, 2023

slimatic commented Sep 12, 2023

github-actions bot commented Oct 13, 2023

tikolsky commented Aug 21, 2023 •

edited by roaks3

Loading

dragon4eg commented Aug 23, 2023 •

edited

Loading

KatrinaHoffert commented Aug 28, 2023 •

edited

Loading

KatrinaHoffert commented Aug 28, 2023 •

edited

Loading

ShaiShalevSQream commented Sep 3, 2023 •

edited

Loading