Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing a node_pool's initial_node_count forces recreation of the whole container_cluster #6889

Closed
mltsy opened this issue Jul 29, 2020 · 12 comments
Assignees
Labels

Comments

@mltsy
Copy link

mltsy commented Jul 29, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform 0.12.24
Google provider version 3.22.0

Affected Resource(s)

  • google_container_node_pool

Terraform Configuration Files

This issue should occur for any node_pool with initial_node_count set to a particular value. HEre's an example

resource "google_container_cluster" "cluster" {
  location = "us-east1"

  name               = "cluster-name"
  min_master_version = "latest"

  remove_default_node_pool = true
  initial_node_count = 1
}

resource "google_container_node_pool" "node_pool" {
  location           = google_container_cluster.cluster.location
  name               = "random-node-pool-name"
  cluster            = google_container_cluster.cluster.name
  initial_node_count = 1

  autoscaling {
    min_node_count = 1
    max_node_count = 3
  }

  node_config {
    image_type = "UBUNTU"

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

Debug Output

I can't gather this without launching a new nodepool to recreate the problem. Let me know if it's really necessary.

Expected Behavior

When I manually change the size of a node_pool in Google's Cloud Console (for whatever reason), it changes the initial_node_count setting for that node_pool based on my manual update of the current node_pool size. Running TF (assuming the config is unchanged) after this node_pool update should have no effect. It should not require re-creation of the entire cluster.

Actual Behavior

Running TF after this node_pool update requires re-creation of the entire cluster, based on the fact that it sees initial_node_count has changed.

Steps to Reproduce

  1. terraform apply (This may require some finagling of google permissions, etc.)
  2. Change the number of nodes to 2 per zone, using the Google Cloud Console web GUI
  3. terraform apply
  4. Thank your God that you noticed it wanted to destroy your entire cluster before you confirmed the plan... 😉

Important Factoids

The most unexpected part here is that changing the current nodepool size from the Cloud Console actually alters the initial_node_count setting - that doesn't make any sense. Maybe we actually need to file a bug report in the GKE API project? But I don't think it would hurt anyone to just have the provider ignore the initial_node_count once the node_pool exists, right? initial_node_count doesn't do anything after that point anyway...

References

The provider docs do point out that "Changing this will force recreation of the resource." - however, they do not point out that resizing your nodepool manually will have the same effect (nor should they, I don't think). If we change the behavior - we may need to update that bit of the documentation to indicate that changing initial_node_count will not make any updates to an existing nodepool.

@ghost ghost added the bug label Jul 29, 2020
@edwardmedia edwardmedia self-assigned this Jul 29, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Jul 29, 2020

@mltsy where do you see initial_node_count on Web GUI? I updated the Number of Nodes on the Web GUI and then run tf plan, it shows below which seems make sense. Is this not what you saw?

~ node_count          = 2 -> 1
.....
Plan: 0 to add, 1 to change, 0 to destroy.

@mltsy
Copy link
Author

mltsy commented Jul 29, 2020

Oh - that's strange. That is the same thing I did, but that is not what I saw. The output I got showed:

  # module.cluster.google_container_node_pool.node_pool[0] must be replaced
-/+ resource "google_container_node_pool" "node_pool" {
        cluster             = "staging"
      ~ id                  = "us-east1/staging/k8s-pool-staging" -> (known after apply)
      ~ initial_node_count  = 3 -> 1 # forces replacement
      ...

I did do this about 2 weeks ago, and just hadn't gotten around to submitting the ticket (because I wanted to make sure the provider in the config was relatively up-to-date which it was) so, the GUI does look slightly different now from what I recall 2 weeks ago. It's possible that has changed and it has been fixed... I'll try to verify...

Otherwise, if you happened to change the "location" of the cluster from us-east1 to a specific zone, it may have created a zonal cluster instead, which may not be affected the same way... ?

@ghost ghost removed the waiting-response label Jul 29, 2020
@mltsy
Copy link
Author

mltsy commented Jul 30, 2020

Okay - I just did it again, and was able to reproduce the same result. This is a screenshot of the field I edited:

image

@edwardmedia
Copy link
Contributor

edwardmedia commented Jul 30, 2020

@mltsy That is the same place where I updated the node_pool size. But I noticed below line in your plan which indicates new id is needed. That triggers the recreation plan. Could you share your debug log? Thanks

~ id                  = "us-east1/staging/k8s-pool-staging" -> (known after apply)

@mltsy
Copy link
Author

mltsy commented Jul 30, 2020

Well, practically everything in the plan is new, because of the re-creation, but if you look at the next line you can see it's saying the initial_node_count is the reason for the replacement (it literally says # forces replacement on that line, I didn't add that).

I verified this by using gcloud CLI to describe the node_pool, where I can see that the intial_node_count has changed from 1 to 3 according to the API. I'll try to get a debug log, but given the documentation, I think this is expected behavior for the provider, given the fact that the initial_node_count has changed according to the API.

@ghost ghost removed the waiting-response label Jul 30, 2020
@mltsy
Copy link
Author

mltsy commented Jul 30, 2020

Here's a simpler way to reproduce the root cause, and figure out why you're not seeing the same thing as me:

  1. gcloud container clusters create joes-test --region=us-central1 (wait for it to complete)
  2. gcloud container clusters describe joes-test --region=us-central1 | grep initialNodeCount (should be 3)
  3. Change the "Number of Nodes" in Cloud Console from 3 to 1 and wait for it to finish resizing
  4. gcloud container clusters describe joes-test --region=us-central1 | grep initialNodeCount (should now be 1)

@mltsy
Copy link
Author

mltsy commented Jul 30, 2020

Let me know if you still need a debug log - I think this should help you reproduce the issue (without needing a debug log?)

@edwardmedia
Copy link
Contributor

edwardmedia commented Jul 30, 2020

@mltsy I see. In my test, I did not have below line in google_container_node_pool. It is only in google_container_cluster. Can you try by removing it from the google_container_node_pool?

initial_node_count = 1

@mltsy
Copy link
Author

mltsy commented Jul 30, 2020

Yes! In fact, that is the work-around we're currently using, and it seems to be working alright 😃

However, I think there are cases where one might want to actually set an initial_node_count (for faster provisioning of a new cluster) - and in that case, it would be nice if the provider wouldn't try to destroy the cluster if that ever changes. I think the reason the GKE API's setSize endpoint changes initial_node_count is that it affects how many nodes come online if you add a new availability zone as well. The fact that there's no way to set this without changing the current auto-scaled node count is an API issue, but given that API issue exists... it would still be nice if the provider would handle a discrepancy in some way that is less drastic than destroying the cluster 😄 (Maybe just resize the cluster, or... ignore that number entirely...?)

@ghost ghost removed the waiting-response label Jul 30, 2020
@edwardmedia
Copy link
Contributor

@mltsy I am glad the workaround works for you. You have made good use cases. Please file an enhancement to continue for the requirements. I am closing this issue then.

@mltsy
Copy link
Author

mltsy commented Jul 30, 2020

I'll try to open an enhancement request tomorrow. For now, I think it would make sense to at least document and warn users about this odd behavior (#6896)

@ghost
Copy link

ghost commented Aug 30, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators Aug 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants