Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia taint along custom taints in google_container_node_pool #7928

Closed
andre-lx opened this issue Dec 3, 2020 · 21 comments
Closed

nvidia taint along custom taints in google_container_node_pool #7928

andre-lx opened this issue Dec 3, 2020 · 21 comments
Assignees
Labels
breaking-change forward/linked persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/container size/s
Milestone

Comments

@andre-lx
Copy link

andre-lx commented Dec 3, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

terraform -v

Terraform v0.13.5
+ provider registry.terraform.io/hashicorp/google v3.49.0
+ provider registry.terraform.io/hashicorp/google-beta v3.49.0

Affected Resource(s)

google_container_node_pool

Terraform Configuration Files

resource "google_container_node_pool" "gpu_pool_test" {
  ...

    taint = [
      {
        effect = "NO_SCHEDULE"
        key    = "nvidia.com/gpu"
        value  = "present"
      },
      {
        key    = "another_taint"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
    ]

....
}

Debug Output

Right now, we have a lot of pools, and with our gpu pools we have our own taints, but we need to comment this taint in the first deploy:

{
  effect = "NO_SCHEDULE"
  key    = "nvidia.com/gpu"
  value  = "present"
}

Otherwise, terraform will output the error:

Error: error creating NodePool: googleapi: Error 400: Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE., badRequest

After the first deploy, we need to uncomment in the subsequent deploys (terraform apply), or terraform will replace the node_pool each time we run the apply command.

          ~ taint             = [ # forces replacement
                {
                    effect = "NO_SCHEDULE"
                    key    = "another_taint"
                    value  = "true"
                },
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "nvidia.com/gpu"
                  - value  = "present"
                },
            ]

Important Factoids

Authenticating as a service account instead of a user.

b/299312479

@ghost ghost added the bug label Dec 3, 2020
@edwardmedia edwardmedia self-assigned this Dec 4, 2020
@edwardmedia
Copy link
Contributor

@andre-lx help me understand how it should work after you uncomment the block?

@andre-lx
Copy link
Author

andre-lx commented Dec 4, 2020

@andre-lx help me understand how it should work after you uncomment the block?

Hi @edwardmedia . Don't know if I understand correctly your question.

After uncomment the nvidia taint, everything works correctly in the updates.

The problem is with the first deploy using terraform apply, if the gpu pool have more than one taint.

I will provide a more extensive example:

First terraform apply:

gke-cluster.tf

resource "google_container_cluster" "gke_cluster" {
....
}

resource "google_container_node_pool" "gpu_pool" {
  name     = "gpu-pool"
  project  = project.id
  location = zone

  ...

  cluster            = google_container_cluster.gke_cluster.name

  ...

  node_config {
    machine_type = machine_type

    taint = [
      {
        key    = "my_own_taint"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
    ]
  }

  ...

}

This configuration, will work, and the pool is correctly created.
If I want to use my own taint in a gpu pool, I need to create the pool without the gpu taint, or terraform will output the error:

Error: error creating NodePool: googleapi: Error 400: Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE., badRequest

The next terraform apply:

gke-cluster.tf

resource "google_container_cluster" "gke_cluster" {
....
}

resource "google_container_node_pool" "gpu_pool" {
  name     = "gpu-pool"
  project  = project.id
  location = zone

  ...

  cluster            = google_container_cluster.gke_cluster.name

  ...

  node_config {
    machine_type = machine_type

    taint = [
      {
        key    = "my_own_taint"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
      {
       effect = "NO_SCHEDULE"
       key    = "nvidia.com/gpu"
       value  = "present"
      },
    ]
  }

  ...

}

If I don't insert the gpu taint together with our own taints like the previous file, terraform will "force replace" my pools all the time, since the taint is not available in the configurations file.

  # google_container_node_pool.gpu_pool must be replaced
-/+ resource "google_container_node_pool" "gpu_pool" {

        ......

          ~ taint             = [ # forces replacement
                {
                    effect = "NO_SCHEDULE"
                    key    = "another_taint"
                    value  = "true"
                },
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "nvidia.com/gpu"
                  - value  = "present"
                },
            ]

           .....

That's why, I need to comment in the first deploy, and uncomment in the subsequent deploys.

An image with the terraform plan output (with the taint commented):

Screenshot 2020-12-04 at 18 02 58

@edwardmedia
Copy link
Contributor

edwardmedia commented Dec 6, 2020

@andre-lx I have tested cases by providing either one of below taint or both taint together. All tests are fine with me in the first tf apply. Can't hit your error. By changing any taint afterward, it does show force replacement in the following tf apply, which is expected. I noticed the error more than one taint with key nvidia.com/gpu, Are you aware if the key is already in place? Do you provide any other settings in the config that might affect this?

Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE., badRequest
resource "google_container_node_pool" "gpu_pool_test" {
  ...

    taint = [
      {
        effect = "NO_SCHEDULE"
        key    = "nvidia.com/gpu"
        value  = "present"
      },
      {
        key    = "another_taint"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
    ]

....
}

@andre-lx
Copy link
Author

andre-lx commented Dec 6, 2020

Hi @edwardmedia .

Thanks for the quick response.

Since the nvidia taint is the default for the gpu node pools created by gke itself (even if you create the node pools manually), the only configuration missing in my examples, that can actually affect this, is the guest_accelerator, as the following example:

  node_config {
    machine_type = ....

    taint = [
      {
        key    = "another_taint"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
      {
        effect = "NO_SCHEDULE"
        key    = "nvidia.com/gpu"
        value  = "present"
      },
    ]

    guest_accelerator = [
      {
        count = 1
        type  = nvidia-tesla-k80
      },
    ]
  }

Thanks!

@ghost ghost removed the waiting-response label Dec 6, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Dec 7, 2020

@andre-lx below is the state from my first run. Did I miss anything? There are many incompatible configs but that seems beyond what the Terraform provider can control. If you see other cases, can you share your FULL terraform code so I can repro the issue? Another thing you may want to try is to see if you can create the pools using gcloud container ... command

resource "google_container_node_pool" "primary_preemptible_nodes" {
    cluster             = "issue7928-gke-cluster"
    id                  = "projects/myproject/locations/asia-east1-a/clusters/issue7928-gke-cluster/nodePools/issue7928-node-pool"
    initial_node_count  = 1
    instance_group_urls = [
        "https://www.googleapis.com/compute/v1/projects/myproject/zones/asia-east1-a/instanceGroupManagers/gke-issue7928-gke-cl-issue7928-node-p-8fea93f4-grp",
    ]
    location            = "asia-east1-a"
    name                = "issue7928-node-pool"
    node_count          = 1
    node_locations      = [
        "asia-east1-a",
    ]
    project             = "sunedward-1-autotest"
    version             = "1.16.15-gke.4300"
    management {
        auto_repair  = true
        auto_upgrade = true
    }
    node_config {
        disk_size_gb      = 100
        disk_type         = "pd-standard"
        guest_accelerator = [
            {
                count = 1
                type  = "nvidia-tesla-t4"
            },
        ]
        image_type        = "COS"
        labels            = {}
        local_ssd_count   = 0
        machine_type      = "n1-standard-1"
        metadata          = {
            "disable-legacy-endpoints" = "true"
        }
        oauth_scopes      = [
            "https://www.googleapis.com/auth/logging.write",
            "https://www.googleapis.com/auth/monitoring",
        ]
        preemptible       = true
        service_account   = "default"
        taint             = [
            {
                effect = "NO_SCHEDULE"
                key    = "nvidia.com/gpu"
                value  = "present"
            },
        ]
        shielded_instance_config {
            enable_integrity_monitoring = true
            enable_secure_boot          = false
        }
    }
    upgrade_settings {
        max_surge       = 1
        max_unavailable = 0
    }
}

@andre-lx
Copy link
Author

andre-lx commented Dec 7, 2020

Hi @edwardmedia.

You didn't miss anything. Bellow is my full config:

resource "google_container_cluster" "gke_cluster" {
  provider = google-beta
  name     = "my-cluster"
  project  = "my-project"
  location = "europe-west1-b"

  min_master_version = "1.16.15-gke.4300"
  network            = google_compute_network.vpc_gke_cluster.name
  subnetwork         = google_compute_subnetwork.subnet_gke_cluster.name
  networking_mode    = "VPC_NATIVE"

  remove_default_node_pool = true
  initial_node_count       = 1

  logging_service    = "logging.googleapis.com/kubernetes"
  monitoring_service = "monitoring.googleapis.com/kubernetes"

  ip_allocation_policy {
    cluster_ipv4_cidr_block  = "/20"
    services_ipv4_cidr_block = "/20"
  }

  resource_labels = {
    "application" = "my_platform"
  }

  master_auth {

    username = ""
    password = ""

    client_certificate_config {
      issue_client_certificate = false
    }
  }
}

resource "google_container_node_pool" "primary_preemptible_nodes" {
    cluster             = google_container_cluster.gke_cluster.name
    initial_node_count  = 1

    location            = "europe-west1-b"
    name                = "issue7928-node-pool"

    project             = "my-project"
    version             = "1.16.15-gke.4300"
    management {
        auto_repair  = true
        auto_upgrade = true
    }
    node_config {
        disk_size_gb      = 100
        disk_type         = "pd-standard"
        guest_accelerator = [
            {
                count = 1
                type  = "nvidia-tesla-k80"
            },
        ]
        image_type        = "COS"
        labels            = {}
        local_ssd_count   = 0
        machine_type      = "n1-standard-1"
        metadata          = {
            "disable-legacy-endpoints" = "true"
        }
        oauth_scopes      = [
            "https://www.googleapis.com/auth/logging.write",
            "https://www.googleapis.com/auth/monitoring",
        ]
        preemptible       = true
        service_account   = "default"
        taint             = [
            {
                effect = "NO_SCHEDULE"
                key    = "nvidia.com/gpu"
                value  = "present"
            },
        ]
        shielded_instance_config {
            enable_integrity_monitoring = true
            enable_secure_boot          = false
        }
    }
    upgrade_settings {
        max_surge       = 1
        max_unavailable = 0
    }
}

I just copy and paste your google_container_node_pool in my files and run tf apply. The follow error occured:

Error: error creating NodePool: googleapi: Error 400: Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE., badRequest

The full tf apply output:

Terraform will perform the following actions:

  # google_container_node_pool.primary_preemptible_nodes will be created
  + resource "google_container_node_pool" "primary_preemptible_nodes" {
      + cluster             = "my-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 1
      + instance_group_urls = (known after apply)
      + location            = "europe-west1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "issue7928-node-pool"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + project             = "my-project"
      + version             = "1.16.15-gke.4300"

      + management {
          + auto_repair  = true
          + auto_upgrade = true
        }

      + node_config {
          + disk_size_gb      = 100
          + disk_type         = "pd-standard"
          + guest_accelerator = [
              + {
                  + count = 1
                  + type  = "nvidia-tesla-k80"
                },
            ]
          + image_type        = "COS"
          + labels            = (known after apply)
          + local_ssd_count   = 0
          + machine_type      = "n1-standard-1"
          + metadata          = {
              + "disable-legacy-endpoints" = "true"
            }
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/logging.write",
              + "https://www.googleapis.com/auth/monitoring",
            ]
          + preemptible       = true
          + service_account   = "default"
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "nvidia.com/gpu"
                  + value  = "present"
                },
            ]

          + shielded_instance_config {
              + enable_integrity_monitoring = true
              + enable_secure_boot          = false
            }

          + workload_metadata_config {
              + node_metadata = (known after apply)
            }
        }

      + upgrade_settings {
          + max_surge       = 1
          + max_unavailable = 0
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions in workspace "my-workspace"?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

google_container_node_pool.primary_preemptible_nodes: Creating...

Error: error creating NodePool: googleapi: Error 400: Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE., badRequest

Creating the pool using the gcloud container command, with the same service account as terraform (also tested with my admin account using email):

gcloud container node-pools create issue7928-node-pool --accelerator type=nvidia-tesla-t4,count=1 --cluster my-cluster --machine-type n1-standard-1 --zone europe-west1-b --node-taints nvidia.com/gpu=present:NoSchedule

Output:

ERROR: (gcloud.container.node-pools.create) ResponseError: code=400, message=Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE.

This makes sense, since the nvidia taint is already added by default on gpu node pools by the gke itself.

On the terraform side, if you don't add this taint the gpu pool is created successfully. The problem, as I already described, on the updates, since terraform always show the "forces replacement".

It's important to refer, that, if you don't need to use custom taints (so, without specifying the taint block in the config file), the creation and updates works fine at the moment, and the nvidia taint is added by terraform to the state file, as showing bellow.

First tf apply:

Terraform will perform the following actions:

  # google_container_node_pool.primary_preemptible_nodes will be created
  + resource "google_container_node_pool" "primary_preemptible_nodes" {
      + cluster             = "my-cluster"
      + id                  = (known after apply)
      + initial_node_count  = 1
      + instance_group_urls = (known after apply)
      + location            = "europe-west1-b"
      + max_pods_per_node   = (known after apply)
      + name                = "issue7928-node-pool"
      + name_prefix         = (known after apply)
      + node_count          = (known after apply)
      + node_locations      = (known after apply)
      + project             = "my-project"
      + version             = "1.16.15-gke.4300"

      + management {
          + auto_repair  = true
          + auto_upgrade = true
        }

      + node_config {
          + disk_size_gb      = 100
          + disk_type         = "pd-standard"
          + guest_accelerator = [
              + {
                  + count = 1
                  + type  = "nvidia-tesla-k80"
                },
            ]
          + image_type        = "COS"
          + labels            = (known after apply)
          + local_ssd_count   = 0
          + machine_type      = "n1-standard-1"
          + metadata          = {
              + "disable-legacy-endpoints" = "true"
            }
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/logging.write",
              + "https://www.googleapis.com/auth/monitoring",
            ]
          + preemptible       = true
          + service_account   = "default"
          + taint             = (known after apply)

          + shielded_instance_config {
              + enable_integrity_monitoring = true
              + enable_secure_boot          = false
            }

          + workload_metadata_config {
              + node_metadata = (known after apply)
            }
        }

      + upgrade_settings {
          + max_surge       = 1
          + max_unavailable = 0
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions in workspace "my-workspace"?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

google_container_node_pool.primary_preemptible_nodes: Creating...
....
google_container_node_pool.primary_preemptible_nodes: Still creating... [1m20s elapsed]
google_container_node_pool.primary_preemptible_nodes: Creation complete after 1m24s [id=projects/my-project/locations/europe-west1-b/clusters/my-cluster/nodePools/issue7928-node-pool]

Subsequent tf apply (with the taint block comment or uncomment):

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Getting the terraform state show google_container_node_pool.primary_preemptible_nodes for a pool without the taint block, you see that the taint nvidia is added to the state file:

        taint             = [
            {
                effect = "NO_SCHEDULE"
                key    = "nvidia.com/gpu"
                value  = "present"
            },
        ]

In the next tf apply, terraform checks that the resource in gke it's equal to the state file, and no replacement is needed.

...

Getting the terraform state show google_container_node_pool.primary_preemptible_nodes3 for a pool with the taint block, but only with the custom taint, you can also see that the nvidia taint being add to the state file along with the custom one:

        taint             = [
            {
                effect = "NO_SCHEDULE"
                key    = "another_taint"
                value  = "true"
            },
            {
                effect = "NO_SCHEDULE"
                key    = "nvidia.com/gpu"
                value  = "present"
            },
        ]

So, it's really strange, that terraform thinks that the gpu pool needs a replacement:

          ~ taint             = [ # forces replacement
                {
                    effect = "NO_SCHEDULE"
                    key    = "another_taint"
                    value  = "true"
                },
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "nvidia.com/gpu"
                  - value  = "present"
                },
            ]

The question is, why terraform forces replacement of an array that is equal to the same resource in the state file using custom taints? Since with only the nividia taint, the taint array are successfully added to the state file, and in the subsequent tf apply they match perfectly, so no replacement is needed.

Thanks!

@ghost ghost removed the waiting-response label Dec 7, 2020
@andre-lx andre-lx changed the title nvidia taint in node_pool nvidia taint along custom taints in google_container_node_pool Dec 7, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Dec 7, 2020

@andre-lx forceReplacement on taint is by design. Can you explain why it should not trigger node pool recreation?
Do you still have questions regarding Found more than one taint with key nvidia.com/gpu...? I think running gcloud command... has explained why.

@andre-lx
Copy link
Author

andre-lx commented Dec 7, 2020

Hi @edwardmedia.

In short,
Since I can't create the node pool with the nvidia taint, since it's a default from gke, how can I prevent the pool recreation each time I run tf apply? How can I set custom taints at the same time as the nvidia taint? Right now, as I said, I need to comment the nvidia taint on pool creation, and uncomment the nivida taint in the subsequent apply to ensure that the pool is not recreated. After this two steps, I can run tf apply forever and the pool is never recreated.

Why the pool is recreated if the nvidia taint is the default by gke?

And, why the pool is not recreated if no custom taints are used (or better, if only the nvidia taint exists).

@ghost ghost removed waiting-response labels Dec 7, 2020
@edwardmedia
Copy link
Contributor

@andre-lx I am not sure if I understand what you said correctly. In my tests, I have tried to put 1) both nvidia and a customer taint together 2) either one of taint in new node pools. All 3 cases were fine. No exceptions were received. I don't understand what you meant below.

Since I can't create the node pool with the nvidia taint, ...

Where do you see nvidia taint is the default by gke? Can you share a document?

From the provider's perspective, any changes on taints will trigger pool recreation because I don't see GCP API provides a way you can use to update taints directly. Instead, if you run kubectl, you can update the taints, but that is not what Terraform can manage. Does this make sense to you?

@andre-lx
Copy link
Author

andre-lx commented Dec 7, 2020

@edwardmedia the nvidia taint is created by default on gpu node pools as you can see here:
https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create

That't why (I think), I can't add the taint in node pools at creation time, as I explained in the other comments, and that's why terraform and gcloud give me the error:

Error: error creating NodePool: googleapi: Error 400: Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE., badRequest

Because of this I don't understand how did you manage to create the gpu pool with the nvidia taint specified.

I understand, that if you change the taints in both google platform, or via terraform, the terraform will recreate the pool, that's make a lot of sense and I was not expecting another way (since the state file is different from the resource itself). The problem here, is that, using customer specific taints, I can't create the pool with the nvidia taint, and I can't tf apply an unchanged pool without specifiying the nvidia taint after creation.

And that's why, I need to comment the nvidia taint on creation (since this is added by gke itself), and uncomment the nvidia taint in the subsequent tf apply.

I will put this to kind of examples, maybe makes it easy:

1 - No taints in config file:
1.1 - I create the pool with no taints (taint = [])
1.2 - The pool is created successfully, and the nvidia taint is added to the state file (again, since this is created automatically by gke)
1.3 - All the future tf apply will work perfectly, since the taint is in the state file, as well in the gke.

2 - with both nvidia and costumer specific taint:
2.1 - I try to create the pool, but the pool can't be crated because of the error:

Error: error creating NodePool: googleapi: Error 400: Found more than one taint with key nvidia.com/gpu and effect NO_SCHEDULE., badRequest

2.2 - Solution: create the pool has above (example 3)

3 - Only with one costumer specific taint in the config:
3.1 - I create with a taint like this:

        taint             = [
            {
                effect = "NO_SCHEDULE"
                key    = "another_taint"
                value  = "true"
            },
       ]

3.2 - The pool is created successfully, and the costumer specific taint as well the nvidia taint is added to the state file (again, since this is created automatically by gke)
3.3 - All the future tf apply, will ask for pool replacement. Why? That's is the part that don't make sense, the state file includes the nvidia taint as well the costumer specific created in step 3.1.
3.4 - Solution: add the nvidia taint to the taint block:

        taint             = [
            {
                effect = "NO_SCHEDULE"
                key    = "another_taint"
                value  = "true"
            },
            {
                effect = "NO_SCHEDULE"
                key    = "nvidia.com/gpu"
                value  = "present"
            },
        ]

3.5 - All the future tf applywill work perfectly.

@ghost ghost removed waiting-response labels Dec 7, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Dec 7, 2020

@andre-lx I see. Thanks for the link. In my tests, all node pools were added to a new cluster, which is different from adding pools to an existing cluster. That explains why it works for mine and it not for yours

When you add a GPU node pool to an existing cluster that already runs a non-GPU node pool, GKE automatically taints

All behaviors you have experienced appear to be controlled by gke/kubenetes. I don't think the provider has much space to do. I am glad you have found a workaround

@edwardmedia
Copy link
Contributor

@andre-lx closing this issue then. Feel free to reopen if you see there is something the provider can help. Thank you

@rileykarson rileykarson assigned slevenick and unassigned slevenick Dec 7, 2020
@rileykarson rileykarson added this to the Goals milestone Dec 14, 2020
@andre-lx
Copy link
Author

Hi. Some update from my part. As an workaround:

From the docs:

taint - (Optional) A list of Kubernetes taints to apply to nodes. GKE's API can only set this field on cluster creation. However, GKE will add taints to your nodes if you enable certain features such as GPUs. If this field is set, any diffs on this field will cause Terraform to recreate the underlying resource. Taint values can be updated safely in Kubernetes (eg. through kubectl), and it's recommended that you do not use this field to manage taints. If you do, lifecycle.ignore_changes is recommended. Structure is documented below.

So you can set only one taint (without the nvidia taint), and ignore the changes with the lifecycleblock:

resource "google_container_node_pool" "primary_preemptible_nodes" {
  node_config {
    machine_type = ....

    taint = [
      {
        key    = "another_taint"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
    ]

    guest_accelerator = [
      {
        count = 1
        type  = nvidia-tesla-k80
      },
    ]
  }

  lifecycle {
    ignore_changes = [
      node_config[0].taint,
    ]
  }
}

With this, you are able to create and update, without losing the nvidia taint. The only problem I found, is if you need to update the taints in your terraform receipt. This changes are algo ignored.

@AndreaGiardini
Copy link

Just adding my voice here as well. This is a problem and it's very annoying since every time terraform tries to re-create the nodepool since the taints do not match.

The workarounds are:

  • Ignore all the taints altogether (like suggested above)
  • Create the nodepools without the nvidia taint and add them after the first run

More people are discussing the problem here: terraform-google-modules/terraform-google-kubernetes-engine#703

@nader-bitstrapped-com
Copy link

nader-bitstrapped-com commented Jul 1, 2021

@andre-lx ignoring taint changes with in the life_cycle block is a great workaround. Much better than commenting/uncommenting. By the way, I did the same with more than one taint and it works:

resource "google_container_node_pool" "kubeflow_primary_gpu" {
    # ...

  node_config {
    # ...
    taint = [
      {
        key    = "preemptible"
        value  = "true"
        effect = "NO_EXECUTE"
      },
      {
        key    = "cloud.google.com/gke-preemptible"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
    ]
  }

  lifecycle {
    ignore_changes = [
      node_config[0].taint,
    ]
  }
}

@rileykarson
Copy link
Collaborator

Taints are likely to get fixed in a future major release. The current model for them has proven difficult enough to work with that I don't think we can fix it by adding behaviours in a backwards-compatible way.

@rileykarson
Copy link
Collaborator

Closed in GoogleCloudPlatform/magic-modules#9011

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
breaking-change forward/linked persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/container size/s
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants