GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ #14465

stevekoskie · 2023-05-01T20:58:32Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.4.6
on darwin_amd64
provider registry.terraform.io/hashicorp/google v4.63.1

Affected Resource(s)

google_container_cluster

Terraform Configuration Files

resource "google_service_account" "default" {
  account_id   = "service-account-id-test"
  display_name = "Service Account GKE Testing"
}

resource "google_container_cluster" "primary" {
  name     = "my-gke-cluster-test"
  location = "us-central1"
  remove_default_node_pool = true
  initial_node_count       = 1
  cluster_autoscaling {
    enabled = true
    resource_limits {
      resource_type = "cpu"
      minimum       = 6
      maximum       = 50
    }
    resource_limits {
      resource_type = "memory"
      minimum       = 12
      maximum       = 100
    }
    auto_provisioning_defaults {
      service_account = google_service_account.default.email
      oauth_scopes = [
        "https://www.googleapis.com/auth/cloud-platform"
      ]
    }

  }


}

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "my-node-pool"
  location   = "us-central1"
  cluster    = google_container_cluster.primary.name
  node_count = 1

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    service_account = google_service_account.default.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

Expected Behavior

Terraform should create a working node auto-provisioning configuration that adds more nodes and node pools as demand increases.

Actual Behavior

Using a GKE provider with the supplied code with a higher version than v4.43.0 fails to create a working node auto-provisioning configuration. It looks like it is configured correctly, however if you create a resource demand on the cluster, it will not add any nodes or node pools.

Steps to Reproduce

Run the supplied code using version v4.43.0
Create a kubernetes pod manifest that requests 16 CPU and apply it to the default namespace
Verify that the node pool and nodes are created as expected
Terraform destroy what you have created
Change the provider version to v4.63.1 or any version higher than v4.43.0
Create the same cluster again
After the cluster is build, create a k8s pod manifest with 16 CPU request and apply it to the default namespace
Observe that a new nodes or node pools are not created

Important Factoids

The issue seems to stem around the usage of the auto_provisioning_defaults section. If that block is omitted, then this bug does not appear.

The text was updated successfully, but these errors were encountered:

edwardmedia · 2023-05-02T15:20:12Z

@stevekoskie by trying below deployment, I don't see the difference on clusters of between v4.43.0 and v4.63.1 which are on your above config. In both cases, the pod seems working fine.

kubectl create deployment hello-server --image=us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0

What did you do for step 2 and 7? Can you share the details? How was it to the node pool and nodes are created as expected?

stevekoskie · 2023-05-02T16:45:50Z

@edwardmedia
Here is the pod I created to force the CPU request so that a new node pool and node is created automatically.

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  labels:
    app: ubuntu
spec:
  containers:
  - name: ubuntu
    image: ubuntu:latest
    command: ["/bin/sleep", "3650d"]
    imagePullPolicy: IfNotPresent
    resources:
      requests:
        cpu: "16"
  restartPolicy: Always

Save that file, and then do a kubectl apply -f <filename> -n default
The request: CPU 16 should force the cluster to add an additional node pool that can handle that request. It will work in the older version but not the newer version.

edwardmedia · 2023-05-03T15:00:55Z

@stevekoskie thanks for the yaml file. I am able to repro the issue now. To temporally workaround, you may set auto_repair to true.

management was introduced in v4.44.0, Below is the partial request payload when management is not provided in the config. With that, both Auto upgrade and Auto repair are set to disabled

"autoscaling": {
"autoprovisioningNodePoolDefaults": {
"diskSizeGb": 100,
"diskType": "pd-standard",
"imageType": "COS_CONTAINERD",
"management": {}, <----
...
},

By omitting management in the payload, both Auto upgrade and Auto repair are set to enabled.

stevekoskie · 2023-05-03T19:31:43Z

Hello @edwardmedia thank you for your efforts.
I used the following terraform code to create a new cluster from scratch

resource "google_container_cluster" "primary" {
  name     = "my-gke-cluster-test"
  location = "us-central1"

  remove_default_node_pool = true
  initial_node_count       = 1
  cluster_autoscaling {
    enabled = true
    resource_limits {
      resource_type = "cpu"
      minimum       = 6
      maximum       = 50
    }
    resource_limits {
      resource_type = "memory"
      minimum       = 12
      maximum       = 100
    }
    auto_provisioning_defaults {

      service_account = google_service_account.default.email
      oauth_scopes = [
        "https://www.googleapis.com/auth/cloud-platform"
      ]
      management {
        auto_repair = true
      }
    }
  }
}

Sadly this did not work. Once I included the management block, it set max_surge to 0, which means it won't scale up. I then add max_surge in a upgrade_settings block to , ie max_surge = 1. That too didn't work. When I apply the 16 CPU request pod from the comment above, the node pool is never added. The only solution was to go into the UI, edit the Node auto-provisioning , change nothing and hit save.

github-actions · 2023-06-08T02:18:21Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

stevekoskie added the bug label May 1, 2023

stevekoskie changed the title ~~GKE cluster_autoscaling doesn't work in versions than v4.43.0-v4.63.1+~~ GKE node auto-provisioning doesn't work in versions than v4.43.0-v4.63.1+ May 1, 2023

stevekoskie changed the title ~~GKE node auto-provisioning doesn't work in versions than v4.43.0-v4.63.1+~~ GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ May 1, 2023

edwardmedia self-assigned this May 1, 2023

edwardmedia added the service/container label May 2, 2023

edwardmedia added the waiting-response label May 2, 2023

github-actions bot removed the waiting-response label May 2, 2023

edwardmedia mentioned this issue May 3, 2023

remove management when it is not set GoogleCloudPlatform/magic-modules#7868

Merged

5 tasks

edwardmedia closed this as completed in GoogleCloudPlatform/magic-modules#7868 May 8, 2023

This was referenced May 8, 2023

remove management when it is not set #14519

Merged

remove management when it is not set hashicorp/terraform-provider-google-beta#5605

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ #14465

GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ #14465

stevekoskie commented May 1, 2023

edwardmedia commented May 2, 2023 •

edited

Loading

stevekoskie commented May 2, 2023 •

edited

Loading

edwardmedia commented May 3, 2023 •

edited

Loading

stevekoskie commented May 3, 2023

github-actions bot commented Jun 8, 2023

GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ #14465

GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ #14465

Comments

stevekoskie commented May 1, 2023

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

edwardmedia commented May 2, 2023 • edited Loading

stevekoskie commented May 2, 2023 • edited Loading

edwardmedia commented May 3, 2023 • edited Loading

stevekoskie commented May 3, 2023

github-actions bot commented Jun 8, 2023

edwardmedia commented May 2, 2023 •

edited

Loading

stevekoskie commented May 2, 2023 •

edited

Loading

edwardmedia commented May 3, 2023 •

edited

Loading