Changing a node_pool's initial_node_count forces recreation of the whole container_cluster #6889

mltsy · 2020-07-29T21:59:12Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform 0.12.24
Google provider version 3.22.0

Affected Resource(s)

google_container_node_pool

Terraform Configuration Files

This issue should occur for any node_pool with initial_node_count set to a particular value. HEre's an example

resource "google_container_cluster" "cluster" {
  location = "us-east1"

  name               = "cluster-name"
  min_master_version = "latest"

  remove_default_node_pool = true
  initial_node_count = 1
}

resource "google_container_node_pool" "node_pool" {
  location           = google_container_cluster.cluster.location
  name               = "random-node-pool-name"
  cluster            = google_container_cluster.cluster.name
  initial_node_count = 1

  autoscaling {
    min_node_count = 1
    max_node_count = 3
  }

  node_config {
    image_type = "UBUNTU"

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

Debug Output

I can't gather this without launching a new nodepool to recreate the problem. Let me know if it's really necessary.

Expected Behavior

When I manually change the size of a node_pool in Google's Cloud Console (for whatever reason), it changes the initial_node_count setting for that node_pool based on my manual update of the current node_pool size. Running TF (assuming the config is unchanged) after this node_pool update should have no effect. It should not require re-creation of the entire cluster.

Actual Behavior

Running TF after this node_pool update requires re-creation of the entire cluster, based on the fact that it sees initial_node_count has changed.

Steps to Reproduce

terraform apply (This may require some finagling of google permissions, etc.)
Change the number of nodes to 2 per zone, using the Google Cloud Console web GUI
terraform apply
Thank your God that you noticed it wanted to destroy your entire cluster before you confirmed the plan... 😉

Important Factoids

The most unexpected part here is that changing the current nodepool size from the Cloud Console actually alters the initial_node_count setting - that doesn't make any sense. Maybe we actually need to file a bug report in the GKE API project? But I don't think it would hurt anyone to just have the provider ignore the initial_node_count once the node_pool exists, right? initial_node_count doesn't do anything after that point anyway...

References

The provider docs do point out that "Changing this will force recreation of the resource." - however, they do not point out that resizing your nodepool manually will have the same effect (nor should they, I don't think). If we change the behavior - we may need to update that bit of the documentation to indicate that changing initial_node_count will not make any updates to an existing nodepool.

The text was updated successfully, but these errors were encountered:

edwardmedia · 2020-07-29T22:53:50Z

@mltsy where do you see initial_node_count on Web GUI? I updated the Number of Nodes on the Web GUI and then run tf plan, it shows below which seems make sense. Is this not what you saw?

~ node_count          = 2 -> 1
.....
Plan: 0 to add, 1 to change, 0 to destroy.

mltsy · 2020-07-29T23:16:01Z

Oh - that's strange. That is the same thing I did, but that is not what I saw. The output I got showed:

  # module.cluster.google_container_node_pool.node_pool[0] must be replaced
-/+ resource "google_container_node_pool" "node_pool" {
        cluster             = "staging"
      ~ id                  = "us-east1/staging/k8s-pool-staging" -> (known after apply)
      ~ initial_node_count  = 3 -> 1 # forces replacement
      ...

I did do this about 2 weeks ago, and just hadn't gotten around to submitting the ticket (because I wanted to make sure the provider in the config was relatively up-to-date which it was) so, the GUI does look slightly different now from what I recall 2 weeks ago. It's possible that has changed and it has been fixed... I'll try to verify...

Otherwise, if you happened to change the "location" of the cluster from us-east1 to a specific zone, it may have created a zonal cluster instead, which may not be affected the same way... ?

mltsy · 2020-07-30T00:00:45Z

Okay - I just did it again, and was able to reproduce the same result. This is a screenshot of the field I edited:

edwardmedia · 2020-07-30T01:02:01Z

@mltsy That is the same place where I updated the node_pool size. But I noticed below line in your plan which indicates new id is needed. That triggers the recreation plan. Could you share your debug log? Thanks

~ id                  = "us-east1/staging/k8s-pool-staging" -> (known after apply)

mltsy · 2020-07-30T14:19:48Z

Well, practically everything in the plan is new, because of the re-creation, but if you look at the next line you can see it's saying the initial_node_count is the reason for the replacement (it literally says # forces replacement on that line, I didn't add that).

I verified this by using gcloud CLI to describe the node_pool, where I can see that the intial_node_count has changed from 1 to 3 according to the API. I'll try to get a debug log, but given the documentation, I think this is expected behavior for the provider, given the fact that the initial_node_count has changed according to the API.

mltsy · 2020-07-30T15:02:52Z

Here's a simpler way to reproduce the root cause, and figure out why you're not seeing the same thing as me:

gcloud container clusters create joes-test --region=us-central1 (wait for it to complete)
gcloud container clusters describe joes-test --region=us-central1 | grep initialNodeCount (should be 3)
Change the "Number of Nodes" in Cloud Console from 3 to 1 and wait for it to finish resizing
gcloud container clusters describe joes-test --region=us-central1 | grep initialNodeCount (should now be 1)

mltsy · 2020-07-30T15:06:05Z

Let me know if you still need a debug log - I think this should help you reproduce the issue (without needing a debug log?)

edwardmedia · 2020-07-30T17:48:11Z

@mltsy I see. In my test, I did not have below line in google_container_node_pool. It is only in google_container_cluster. Can you try by removing it from the google_container_node_pool?

initial_node_count = 1

mltsy · 2020-07-30T20:48:31Z

Yes! In fact, that is the work-around we're currently using, and it seems to be working alright 😃

However, I think there are cases where one might want to actually set an initial_node_count (for faster provisioning of a new cluster) - and in that case, it would be nice if the provider wouldn't try to destroy the cluster if that ever changes. I think the reason the GKE API's setSize endpoint changes initial_node_count is that it affects how many nodes come online if you add a new availability zone as well. The fact that there's no way to set this without changing the current auto-scaled node count is an API issue, but given that API issue exists... it would still be nice if the provider would handle a discrepancy in some way that is less drastic than destroying the cluster 😄 (Maybe just resize the cluster, or... ignore that number entirely...?)

edwardmedia · 2020-07-30T22:57:50Z

@mltsy I am glad the workaround works for you. You have made good use cases. Please file an enhancement to continue for the requirements. I am closing this issue then.

mltsy · 2020-07-30T23:16:47Z

I'll try to open an enhancement request tomorrow. For now, I think it would make sense to at least document and warn users about this odd behavior (#6896)

This documents a known issue outlined in #6889

ghost · 2020-08-30T13:50:13Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

ghost added the bug label Jul 29, 2020

edwardmedia self-assigned this Jul 29, 2020

edwardmedia added the waiting-response label Jul 29, 2020

ghost removed the waiting-response label Jul 29, 2020

edwardmedia added the waiting-response label Jul 30, 2020

ghost removed the waiting-response label Jul 30, 2020

edwardmedia added the waiting-response label Jul 30, 2020

ghost removed the waiting-response label Jul 30, 2020

edwardmedia closed this as completed Jul 30, 2020

mltsy mentioned this issue Jul 30, 2020

Add a warning about initial_node_count #6896

Merged

mltsy mentioned this issue Jul 31, 2020

Ignore initial_node_count setting for Node Pool #6901

Open

nat-henderson pushed a commit that referenced this issue Aug 10, 2020

Add a warning about initial_node_count (#6896)

c8e63e0

This documents a known issue outlined in #6889

modular-magician mentioned this issue Aug 10, 2020

Add warning about initial_node_count #6988

Merged

5 tasks

ghost locked and limited conversation to collaborators Aug 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing a node_pool's initial_node_count forces recreation of the whole container_cluster #6889

Changing a node_pool's initial_node_count forces recreation of the whole container_cluster #6889

mltsy commented Jul 29, 2020 •

edited

Loading

edwardmedia commented Jul 29, 2020 •

edited

Loading

mltsy commented Jul 29, 2020 •

edited

Loading

mltsy commented Jul 30, 2020

edwardmedia commented Jul 30, 2020 •

edited

Loading

mltsy commented Jul 30, 2020 •

edited

Loading

mltsy commented Jul 30, 2020 •

edited

Loading

mltsy commented Jul 30, 2020

edwardmedia commented Jul 30, 2020 •

edited

Loading

mltsy commented Jul 30, 2020

edwardmedia commented Jul 30, 2020

mltsy commented Jul 30, 2020

ghost commented Aug 30, 2020

Changing a node_pool's initial_node_count forces recreation of the whole container_cluster #6889

Changing a node_pool's initial_node_count forces recreation of the whole container_cluster #6889

Comments

mltsy commented Jul 29, 2020 • edited Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

edwardmedia commented Jul 29, 2020 • edited Loading

mltsy commented Jul 29, 2020 • edited Loading

mltsy commented Jul 30, 2020

edwardmedia commented Jul 30, 2020 • edited Loading

mltsy commented Jul 30, 2020 • edited Loading

mltsy commented Jul 30, 2020 • edited Loading

mltsy commented Jul 30, 2020

edwardmedia commented Jul 30, 2020 • edited Loading

mltsy commented Jul 30, 2020

edwardmedia commented Jul 30, 2020

mltsy commented Jul 30, 2020

ghost commented Aug 30, 2020

mltsy commented Jul 29, 2020 •

edited

Loading

edwardmedia commented Jul 29, 2020 •

edited

Loading

mltsy commented Jul 29, 2020 •

edited

Loading

edwardmedia commented Jul 30, 2020 •

edited

Loading

mltsy commented Jul 30, 2020 •

edited

Loading

mltsy commented Jul 30, 2020 •

edited

Loading

edwardmedia commented Jul 30, 2020 •

edited

Loading