Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_container_cluster.node_pool should be a Set. #780

Closed
danawillow opened this issue Nov 22, 2017 · 12 comments
Closed

google_container_cluster.node_pool should be a Set. #780

danawillow opened this issue Nov 22, 2017 · 12 comments

Comments

@danawillow
Copy link
Contributor

As mentioned in both #665 and #779, adding/removing node pools will be easier to do if it's a Set. Though it's possible to add/remove node pools in a List, it ends up not being possible to add node pools anywhere but the end of the list, and also leads to unintuitive plans, where attributes are showing as being changed rather than just moved elsewhere in the list.

We can hash on the name of the node pool since those are unique in a cluster, though we'll probably have to add some special handling for node pools that are specified using a name prefix or not specified at all (left to be random).

@danawillow danawillow self-assigned this Nov 22, 2017
@danawillow
Copy link
Contributor Author

So it turns out that because of the way the code is structured, this isn't actually possible right now.

We allow generating a random name at node pool creation time, which is during apply. This means that at plan time, we don't know what the name will hash to. We can't just use a random value for plan (and then let apply decide what it'll actually be) because Terraform has safeguards to make sure the diffs for plan and apply match.

One fix would be to make name required and get rid of the behavior for autogenerating a name. Since Terraform has a random provider anyway, people could just specify that in order to create random names. Since this is a solidly breaking change, it would have to wait until 2.0.0, which will likely occur at some point in Q1 2018 (no promises though).

The alternative is to just go ahead with #779, which would allow having the feature sooner, but with a whole bunch of caveats about using it correctly.

Another point to keep in mind is that adding/removing node pools from a cluster is already possible with the google_container_node_pool resource, but that means manually deleting the default node pool from the cluster if that's something you don't want around. The two resources share the same code, so the only difference between the two of them is that if a node pool were added outside of terraform, that node pool would get deleted with Terraform if you specify all node pools in a cluster resource but not if you specify them with node pool resources.

I'd love to get some feedback from anyone who was hoping to be able to add/remove nodepools from a google_container_cluster resource about what's important to you:

  • Would you rather have a mostly-but-not-entirely working solution sooner, or a definitely working one a bit later?
  • Are you using randomly generated node pool names (or the name_prefix field), or are you specifying the names directly?
  • How many node pools (roughly) do you tend to specify in a cluster?
    *Is there a specific reason you want to specify node pools in the cluster resource instead of a separate node pools resource, or was it a more-or-less arbitrary decision?

Paging a few people who have been active in issues/PRs relating to node pools: @sl1pm4t @davidquarles @rochdev @drzero42 @glindste @georgespalding @burdiyan @birdayz

@davidquarles
Copy link
Contributor

Bummer! My 2¢:

  • Would you rather have a mostly-but-not-entirely working solution sooner, or a definitely working one a bit later? Are these options mutually exclusive? I guess this depends on the full list of caveats and what the projected timeline for 2.0 is, if you're able to elaborate.
  • Are you using randomly generated node pool names (or the name_prefix field), or are you specifying the names directly? We're specifying names directly. For our test pipelines I'm already using random_pet.
  • How many node pools (roughly) do you tend to specify in a cluster? Zero, as of now. I didn't realize list membership/order was what triggered cluster destroy/create, so I avoided the cluster pattern altogether. I'm creating a cluster with the default pool, deleting the default pool, and then creating each node pool with separate resources (which can't be parallelized). Each stage requires waiting for the operation in a null_resource to be truly predictable.
  • Is there a specific reason you want to specify node pools in the cluster resource instead of a separate node pools resource, or was it a more-or-less arbitrary decision? The current process for cluster spin-up (above) currently takes 11-12 minutes for a cluster with one pool. Each additional pool adds something like 2-3 minutes IIRC, while passing node pools at cluster creation enables each node pool to be created in parallel on the backend, and provisioning time drops to around 5 minutes (or less), whether we're using 1 node pool or 100. We provision a test cluster and run the e2e suite on every commit, so there's a very real impact on our team's productivity.

@glindstedt
Copy link

Thanks for taking a thorough look at this!

  • Would you rather have a mostly-but-not-entirely working solution sooner, or a definitely working one a bit later?
    A definitely working one a bit later. We get by currently by only using terraform for initializing a cluster and then managing it through the UI.
  • Are you using randomly generated node pool names (or the name_prefix field), or are you specifying the names directly?
    We're specifying the names directly.
  • How many node pools (roughly) do you tend to specify in a cluster?
    In the cluster resource we only specify the default node pool, since it's created anyway. Then we add another node pool using the node pool resource for a total of 2.
  • Is there a specific reason you want to specify node pools in the cluster resource instead of a separate node pools resource, or was it a more-or-less arbitrary decision?
    More or less arbitrary since the cluster resource creates a default node pool anyway.

@drzero42
Copy link
Contributor

Great to see work being done here :)

  • Would you rather have a mostly-but-not-entirely working solution sooner, or a definitely working one a bit later?
    I would much rather have a definitely working one later as I have already worked around most annoyances currently. There are things that are quite awkward to do right now, but a properly thought through, well done solution later is my preference here.
  • Are you using randomly generated node pool names (or the name_prefix field), or are you specifying the names directly?
    No. For randomly generated node pool names, I would have gone for the random provider. For me at least, neither randomness nor name prefixes needs to be built-in features of the container resources.
  • How many node pools (roughly) do you tend to specify in a cluster?
    Currently, mostly just one in apart from the default-pool, but more will be needed in the future for our use-case.
  • Is there a specific reason you want to specify node pools in the cluster resource instead of a separate node pools resource, or was it a more-or-less arbitrary decision?
    Mostly I am forced to, because of limitations in the current implementation. I can't set labels and autoscaling options on the default-pool, unless I specify it through the node_pool sub. If I do that, I can't use google_container_node_pool to add node pools, as these added node pools are returned in the node_pool sub by the API, which makes Terraform want to delete added node pools and then create them again on consecutive runs. There are probably various ways to fix this, but I guess it mostly stems from the way the API works, which both seperates how node pools are handled from the rest of the cluster, AND ties them very tightly to the cluster.

I doubt anybody actually wants to remove the default-pool. My guess from my own use-case is that users want to be able to specify the configuration of the default-pool as richly as can be done for any additional node pools that are added.

Just spitballing here, but based on my current level of experience with this subject, maybe instead of having some of the attributes from node_pool/node_config exist directly on google_container_cluster for setting up the default-pool, instead have a default_pool sub, which uses the node_pool schema, and then don't tie it directly to the node_pool sub output from the API, but only to the default-pool part of the output. Then only allow additional node pools as google_container_node_pool resources. I believe this approach would bring the best of both worlds. You can do tricks like creating multiple node pools for a cluster, which share a name prefix, with a list of name postfixes, or just numbers, all using the built-in functions in the interpolation feature of Terraform.

@davidquarles
Copy link
Contributor

Totally stoked about this, by the way!

@pdecat
Copy link
Contributor

pdecat commented Nov 30, 2017

  • Would you rather have a mostly-but-not-entirely working solution sooner, or a definitely working one a bit later?
    I'll take both ;)

  • Are you using randomly generated node pool names (or the name_prefix field), or are you specifying the names directly?
    Specified names, easier for monitoring / alerting purposes.

  • How many node pools (roughly) do you tend to specify in a cluster?
    Mostly one for now, usually two during node instance resizing (by the way, it would be awesome if the new pool creation and old pool draining could be automated in some way).
    Probably more in the future when we will make use of specific node capabilities (e.g. GPUs).

  • Is there a specific reason you want to specify node pools in the cluster resource instead of a separate node pools resource, or was it a more-or-less arbitrary decision?
    More or less arbitrary, using a separate resource will allow to use count on them.

Here is how I do it right now with terraform-provider-google 1.2.0:

resource "google_container_cluster" "gke_cluster0" {
  name = "${var.general["project"]}-${var.general["region"]}-cluster0"

  zone       = "${var.general["region"]}-${var.gke["main_zone"]}"
  network    = "${data.terraform_remote_state.infra.net0_name}"
  subnetwork = "${data.terraform_remote_state.infra.subnet0_name}"

  min_master_version = "${var.gke["min_master_version"]}"
  node_version       = "${var.gke["node_version"]}"

  additional_zones = "${formatlist(format("%s-%%s", var.general["region"]), compact(split(",", trimspace(var.gke["additional_zones"]))))}"

  master_auth {
    username = "${var.gke["username"]}"
    password = "${var.gke["password"]}"
  }

  initial_node_count = "${var.gke["min_node_count"]}"

  node_pool {
    name = "${var.general["project"]}-${var.general["region"]}-0"

    node_count = "${var.gke["min_node_count"]}"

    autoscaling {
      min_node_count = "${var.gke["min_node_count"]}"
      max_node_count = "${var.gke["max_node_count"]}"
    }

    node_config {
      machine_type = "${var.gke["machine_type"]}"

      oauth_scopes = [ ${var.gke["oauth_scopes"]} ]

      disk_size_gb = "${var.gke["disk_size_gb"]}"
    }
  }

  maintenance_policy {
    daily_maintenance_window {
      start_time = "06:00"
    }
  }
}

@sl1pm4t
Copy link
Contributor

sl1pm4t commented Dec 4, 2017

Sorry for the late response

  • Would you rather have a mostly-but-not-entirely working solution sooner, or a definitely working one a bit later?

The current functionality is working for us, and isn't causing any pain - but this is a greenfield deployment and we haven't attempting any the upgrade / maintenance scenarios yet.

  • Are you using randomly generated node pool names (or the name_prefix field), or are you specifying the names directly?

We just recently moved to using the name_prefix field, in conjunction with create_before_destroy so we can modify node pools gracefully.

  • How many node pools (roughly) do you tend to specify in a cluster?

Currently 2 - 3 pools per cluster.

  • Is there a specific reason you want to specify node pools in the cluster resource instead of a separate node pools resource, or was it a more-or-less arbitrary decision?

We exclusively use the separate node pool resource, for the flexibility.

@rileykarson
Copy link
Collaborator

@danawillow is this still part of the 2.0.0 milestone or should we move it out?

@danawillow danawillow removed this from the 2.0.0 milestone Nov 19, 2018
@danawillow
Copy link
Contributor Author

Removed, thanks.

@selslack
Copy link

selslack commented Dec 6, 2018

Any updates on this one? Since milestone was removed it's unclear what is the plan for it.

@danawillow
Copy link
Contributor Author

We're not planning on doing anything with it right now. Here's the very long technical writeup of why it just can't be done. If you need to be able to add/remove node pools, do it via the separate node pool resource.


Why we can't convert Node Pools to be a set

Background

Node pools can currently be created in two ways: via the google_container_node_pool resource or by setting the node_pool field on the google_container_cluster resource.

Currently, if a user attempts to add or remove a node pool on the cluster resource, it'll recreate the cluster. This document outlines the different approaches attempted to allow adding/removing node pools from the cluster, eventually culminating in a decision not to do so.

The criteria for a successful solution are:

  • node pools can be added to or removed from any point in the list
  • we still maintain the ability to detect drift on all fields in the node pool

Attempt 1: Leave the Node Pools as a list

This work was started in #779. Leaving the node pools as a list allows adding to / removing from the end of the list. However, if you try to modify the middle of the list, Terraform tries to merge whatever was in that spot before with whatever is in that spot now. For example, take the config:

resource "google_container_cluster" "is-780" {
  name = "is-780"
  zone = "us-central1-a"

  node_pool {
    name               = "first"
    initial_node_count = 1
    node_config {
      disk_size_gb = 50
    }
  }
}

This will lead to a state that has the following in it:

  node_pool.# = 1
  node_pool.0.initial_node_count = 1
  node_pool.0.name = first
  node_pool.0.node_config.# = 1
  node_pool.0.node_config.0.disk_size_gb = 50

Now, I update my config to add a second node pool:

resource "google_container_cluster" "is-780" {
  name = "is-780"
  zone = "us-central1-a"

  node_pool {
    name               = "second"
    initial_node_count = 2
  }

  node_pool {
    name               = "first"
    initial_node_count = 1
    node_config {
      disk_size_gb = 50
    }
  }
}

The plan output, though a bit confusing, looks reasonable. It's shifting the first node pool over to the second slot, and putting the second node pool in at the beginning:

  ~ google_container_cluster.is-780
      node_pool.#:                                   "1" => "2"
      node_pool.0.name:                              "first" => "second"
      node_pool.0.node_count:                        "1" => "2"
      node_pool.1.initial_node_count:                "" => "1"
      node_pool.1.instance_group_urls.#:             "" => <computed>
      node_pool.1.management.#:                      "" => <computed>
      node_pool.1.name:                              "" => "first"
      node_pool.1.node_config.#:                     "0" => "1"
      node_pool.1.node_config.0.disk_size_gb:        "" => "50"
      node_pool.1.node_config.0.guest_accelerator.#: "" => <computed>
      node_pool.1.node_config.0.oauth_scopes.#:      "" => <computed>
      node_pool.1.node_config.0.preemptible:         "" => "false"

However, because of the way that Terraform merges config and state, doing a d.GetChange will show that the "second" node pool should have a disk size of 50:

n:([]interface {}) (len=2 cap=2) {
 (map[string]interface {}) (len=11) {
  [...]
  (string) (len=4) "name": (string) (len=6) "second",
  [...]
  (string) (len=11) "node_config": ([]interface {}) (len=1 cap=1) {
   (map[string]interface {}) (len=15) {
    (string) (len=16) "min_cpu_platform": (string) "",
    (string) (len=12) "disk_size_gb": (int) 50,
    [...]
    }),
    [...]
   }
  }
 },
 (map[string]interface {}) (len=11) {
  [...]
  (string) (len=4) "name": (string) (len=5) "first",
  (string) (len=7) "version": (string) "",
  (string) (len=11) "node_config": ([]interface {}) (len=1 cap=1) {
   (map[string]interface {}) (len=15) {
    [...]
    (string) (len=12) "disk_size_gb": (int) 50,
    [...]
    }),
    [...]
   }
  },
  (string) (len=18) "initial_node_count": (int) 1
 }
}

Since we can't tell for sure what the correct value is for all of the fields, we can't leave the Node Pools as a list.

Attempt 2: Make the Node Pools a Set using the default SchemaSetFunc

Let's take that same config from earlier:

resource "google_container_cluster" "is-780" {
  name = "is-780"
  zone = "us-central1-a"

  node_pool {
    name               = "first"
    initial_node_count = 1
    node_config {
      disk_size_gb = 50
    }
  }
}

Our state now has the following in it:

  node_pool.# = 1
  [...]
  node_pool.2451374666.name = first
  node_pool.2451374666.name_prefix =
  node_pool.2451374666.node_config.# = 1
  node_pool.2451374666.node_config.0.disk_size_gb = 50
  [...]
  node_pool.2451374666.node_count = 1
  node_pool.2451374666.version = 1.9.7-gke.6

Unfortunately, if we run terraform plan without changing anything, we get this output:

  ~ google_container_cluster.is-780
      [...]
      node_pool.2451374666.name:                       "first" => ""
      [...]
      node_pool.2489048920.initial_node_count:         "" => "1"
      node_pool.2489048920.instance_group_urls.#:      "" => <computed>
      [...]
      node_pool.2489048920.name:                       "" => "first"
      node_pool.2489048920.name_prefix:                "" => <computed>
      node_pool.2489048920.node_config.#:              "0" => "1"
      node_pool.2489048920.node_config.0.disk_size_gb: "" => "50"
      [...]
      node_pool.2489048920.node_count:                 "" => <computed>
      node_pool.2489048920.version:                    "" => <computed>

This plan output shows that the hash is different when we run plan than when we write the value to state.

This happens because of fields that are both Optional and Computed. Take version as an example: when we store the node pool in state, it has a value for version, so it's part of the hash. When we run terraform plan again, TF only has an interface{} to hash on and no access to state, so version is considered unset, changing the hash.

Attempt 3: Use a custom SchemaSetFunc

This has the same problems as attempt 2. If our custom function includes fields that are Optional+Computed, the hashes will change between when we write to state and when we run our next plan. If we don't include these fields, then we can't detect drift in them if they are set by a user.

Attempt 4: Use customizediff to diff based on the fields instead of a hash function

The issue with using a SchemaSetFunc for diffing is that our resource has a different value when we pass it into the hash function at the time we write state and the time we do our next plan. If we can diff on the resource only at plan time and not worry about the hash, then we don't have that same problem.

We have the ability to diff on the resource at plan time by using customizediff. If we go through every field, we can check if the values in state and the new values are equivalent, ignoring changes to fields that are optional+computed and don't have a value set in the new config. If we introduce a new field that we change the value of if we find a diff, then we can make that field into our hash and remove everything else from the SchemaSetFunc.

However, this approach fails when we try to diff two fields that have a DiffSuppressFunc. These functions take a *schema.ResourceData as a parameter, and we don't have access to that inside of our CustomizeDiff, just a *schema.ResourceDiff. Since ResourceData is a struct type and not an interface, we can't create our own implementation of it based off of the ResourceDiff. We also can't just ignore that parameter, because it gets used in one of the DiffSuppressFuncs for node pools:

func taintDiffSuppress(k, old, new string, d *schema.ResourceData) bool {
        if strings.HasSuffix(k, "#") {
                oldCount, oldErr := strconv.Atoi(old)
                newCount, newErr := strconv.Atoi(new)
                // If either of them isn't a number somehow, or if there's one that we didn't have before.
                return oldErr != nil || newErr != nil || oldCount == newCount+1
        } else {
                lastDot := strings.LastIndex(k, ".")
                taintKey := d.Get(k[:lastDot] + ".key").(string)
                if taintKey == "nvidia.com/gpu" {
                        return true
                } else {
                        return false
                }
        }
}

This DiffSuppress needs to suppress the diff on an entire nested object based on the value of one of its fields, so it needs the ability to read a specific field out of the ResourceData.

This gist contains the code for this attempt as well as some commented-out code from other attempts.

@ghost
Copy link

ghost commented Jan 6, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators Jan 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants