Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ #14465

Closed

Comments

@stevekoskie
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.4.6
on darwin_amd64
provider registry.terraform.io/hashicorp/google v4.63.1

Affected Resource(s)

  • google_container_cluster

Terraform Configuration Files

resource "google_service_account" "default" {
  account_id   = "service-account-id-test"
  display_name = "Service Account GKE Testing"
}

resource "google_container_cluster" "primary" {
  name     = "my-gke-cluster-test"
  location = "us-central1"
  remove_default_node_pool = true
  initial_node_count       = 1
  cluster_autoscaling {
    enabled = true
    resource_limits {
      resource_type = "cpu"
      minimum       = 6
      maximum       = 50
    }
    resource_limits {
      resource_type = "memory"
      minimum       = 12
      maximum       = 100
    }
    auto_provisioning_defaults {
      service_account = google_service_account.default.email
      oauth_scopes = [
        "https://www.googleapis.com/auth/cloud-platform"
      ]
    }

  }


}

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "my-node-pool"
  location   = "us-central1"
  cluster    = google_container_cluster.primary.name
  node_count = 1

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    service_account = google_service_account.default.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

Expected Behavior

Terraform should create a working node auto-provisioning configuration that adds more nodes and node pools as demand increases.

Actual Behavior

Using a GKE provider with the supplied code with a higher version than v4.43.0 fails to create a working node auto-provisioning configuration. It looks like it is configured correctly, however if you create a resource demand on the cluster, it will not add any nodes or node pools.

Steps to Reproduce

  1. Run the supplied code using version v4.43.0
  2. Create a kubernetes pod manifest that requests 16 CPU and apply it to the default namespace
  3. Verify that the node pool and nodes are created as expected
  4. Terraform destroy what you have created
  5. Change the provider version to v4.63.1 or any version higher than v4.43.0
  6. Create the same cluster again
  7. After the cluster is build, create a k8s pod manifest with 16 CPU request and apply it to the default namespace
  8. Observe that a new nodes or node pools are not created

Important Factoids

The issue seems to stem around the usage of the auto_provisioning_defaults section. If that block is omitted, then this bug does not appear.

@stevekoskie stevekoskie added the bug label May 1, 2023
@stevekoskie stevekoskie changed the title GKE cluster_autoscaling doesn't work in versions than v4.43.0-v4.63.1+ GKE node auto-provisioning doesn't work in versions than v4.43.0-v4.63.1+ May 1, 2023
@stevekoskie stevekoskie changed the title GKE node auto-provisioning doesn't work in versions than v4.43.0-v4.63.1+ GKE node auto-provisioning doesn't work in versions higher than v4.43.0-v4.63.1+ May 1, 2023
@edwardmedia edwardmedia self-assigned this May 1, 2023
@edwardmedia
Copy link
Contributor

edwardmedia commented May 2, 2023

@stevekoskie by trying below deployment, I don't see the difference on clusters of between v4.43.0 and v4.63.1 which are on your above config. In both cases, the pod seems working fine.

kubectl create deployment hello-server --image=us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0

What did you do for step 2 and 7? Can you share the details? How was it to the node pool and nodes are created as expected?

@stevekoskie
Copy link
Author

stevekoskie commented May 2, 2023

@edwardmedia
Here is the pod I created to force the CPU request so that a new node pool and node is created automatically.

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  labels:
    app: ubuntu
spec:
  containers:
  - name: ubuntu
    image: ubuntu:latest
    command: ["/bin/sleep", "3650d"]
    imagePullPolicy: IfNotPresent
    resources:
      requests:
        cpu: "16"
  restartPolicy: Always

Save that file, and then do a kubectl apply -f <filename> -n default
The request: CPU 16 should force the cluster to add an additional node pool that can handle that request. It will work in the older version but not the newer version.

@edwardmedia
Copy link
Contributor

edwardmedia commented May 3, 2023

@stevekoskie thanks for the yaml file. I am able to repro the issue now. To temporally workaround, you may set auto_repair to true.

management was introduced in v4.44.0, Below is the partial request payload when management is not provided in the config. With that, both Auto upgrade and Auto repair are set to disabled

"autoscaling": {
"autoprovisioningNodePoolDefaults": {
"diskSizeGb": 100,
"diskType": "pd-standard",
"imageType": "COS_CONTAINERD",
"management": {}, <----
...
},

By omitting management in the payload, both Auto upgrade and Auto repair are set to enabled.

@stevekoskie
Copy link
Author

Hello @edwardmedia thank you for your efforts.
I used the following terraform code to create a new cluster from scratch

resource "google_container_cluster" "primary" {
  name     = "my-gke-cluster-test"
  location = "us-central1"

  remove_default_node_pool = true
  initial_node_count       = 1
  cluster_autoscaling {
    enabled = true
    resource_limits {
      resource_type = "cpu"
      minimum       = 6
      maximum       = 50
    }
    resource_limits {
      resource_type = "memory"
      minimum       = 12
      maximum       = 100
    }
    auto_provisioning_defaults {

      service_account = google_service_account.default.email
      oauth_scopes = [
        "https://www.googleapis.com/auth/cloud-platform"
      ]
      management {
        auto_repair = true
      }
    }
  }
}

Sadly this did not work. Once I included the management block, it set max_surge to 0, which means it won't scale up. I then add max_surge in a upgrade_settings block to , ie max_surge = 1. That too didn't work. When I apply the 16 CPU request pod from the comment above, the node pool is never added. The only solution was to go into the UI, edit the Node auto-provisioning , change nothing and hit save.

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.