Impossible to reliably create a GKE cluster using terraform #2022

vncntvandriessche · 2018-09-11T10:37:40Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

❯❯❯ terraform -v
Terraform v0.11.8
+ provider.google v1.17.1

Affected Resource(s)

As far as I've tested, the following resources at the least are affected:

google_container_cluster
google_container_node_pool

Terraform Configuration Files

provider "google" {
  credentials = "${file(".account.json")}"
  project     = "example-001"
  region      = "europe-west1"
}

resource "google_container_cluster" "production-001" {
  name               = "production-001"
  zone               = "europe-west1-c"
  initial_node_count = 3

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }
}

resource "google_container_node_pool" "webpool-001" {
  name    = "webpool-001"
  cluster = "${google_container_cluster.production-001.name}"
  zone    = "europe-west1-c"

  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }

  node_count = 3
  autoscaling {
    min_node_count = 3
    max_node_count = 10
  }

  management {
    auto_repair  = false
    auto_upgrade = false
  }
}

Debug Output

https://gist.github.com/vncntvandriessche/84c404a4950eb35abe6b3099ef8cc435

Panic Output

Expected Behavior

I expected terraform to build the GKE cluster and attach the matching node pool without failures due to api errors.

Actual Behavior

We are getting a broken TF-state due to the api reporting an error

Steps to Reproduce

terraform init
terraform apply

Important Factoids

If we run apply again after this failure, terraform will fail due to the pool already existing, but it's not been registered into the state.

References

#0000

The text was updated successfully, but these errors were encountered:

directionless · 2018-09-12T14:54:30Z

I think I'm hitting this, but with a slightly different set of actual behaviors.

The first run, terraform creates the cluster and the node pool, and then panics as described.
Apparently unknown to terraform, the cluster and node pool are created
the next run destroys the cluster and node pool, and attempts to make them anew
triggering the same panic.

If I look at the GKE webui, sometimes it tells me it's resizing the master server, other times that it's creating the node pool. In non-TF experience, I have found that changing node pools can result in long apiserver unavailability as it goes through resizing

For me, it pretty consistently fails at 13 minutes. Which seems like It's acting a bit like there's a timeout. But, it looks like the underlying code has a 30min timeout. So that seems like an interesting discrepancy.

directionless · 2018-09-12T18:30:49Z

Testing some more.... the google_container_cluster seems fine, it's the addition of google_container_node_pool that causes errors.

If I comment out google_container_node_pool it applies fine. I get a GKE cluster, etc. But if I add that back in the apply bombs out at 13min, the node pool is created anyhow. Subsequent applies remove the prior node pool, then timeout at 13min and repeat the cycle

cepefernando · 2018-09-12T21:17:44Z

I have faced the same issue, after some troubleshooting I have noticed that when the node pool has the autoscaling parameter this error appears, as a temporary fix, if you remove that node pool and add a node pool without the autoscaling enabled it should work.

nat-henderson · 2018-09-12T21:49:21Z

Yes, this is an unfortunate error being returned from GKE because the configuration you're pushing is causing it to be unavailable at the 10m mark (which I believe is the current timeout). If you believe that @directionless is correct and that the apiserver will become available again sometime after that, you can increase the timeout for create (or update, if you're hitting this on update) to a sufficiently long window. As a non-k8s expert, I unfortunately can't say for sure, but it certainly feels right. :)

Google's Terraform provider cannot validate your GKE config - there are too many possible configurations for us to be confident we are blocking the ones that will not work while allowing all valid configs. The only change we can really make is to make sure that the node pool does end up in state. I'm happy to add that. I'll try to figure that out and send a PR.

jamielennox · 2018-09-13T01:51:36Z

So I don't think it's a timeout issue. That's 30 minutes for create already (might want to set this the same for update)

		Timeouts: &schema.ResourceTimeout{
			Create: schema.DefaultTimeout(30 * time.Minute),
			Update: schema.DefaultTimeout(10 * time.Minute),
			Delete: schema.DefaultTimeout(10 * time.Minute),
		},

The problem seems to be that the API returns done. The logs start overwriting each other so i got this last message from mitmproxy:

{
    "detail": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "endTime": "2018-09-13T01:36:18.837939633Z",
    "name": "operation-1536801745861-34ba47a8",
    "operationType": "CREATE_NODE_POOL",
    "selfLink": "https://container.googleapis.com/v1beta1/projects/1111111/zones/australia-southeast1-a/operations/operation-1536801745861-34ba47a8",
    "startTime": "2018-09-13T01:22:25.861642499Z",
    "status": "DONE",
    "statusMessage": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "targetLink": "https://container.googleapis.com/v1beta1/projects/111111/zones/australia-southeast1-a/clusters/jamie-test/nodePools/jamie-test-nodes",
    "zone": "australia-southeast1-a"
}

So DONE with a statusMessage is being passed back as a failure from the command. Our choices would seem to be either to ignore the error and try and fetch the nodepool information again from google - or figure out why google APIs changed to start returning a failure.

jamielennox · 2018-09-13T02:26:39Z

note that this seems to happen regardless of the remove_default_node_pool setting which I could see how that might cause the API to not be ready yet.

directionless · 2018-09-14T04:58:19Z

I started looking at this again.

I ran TF apply, and 12m 30s later, got the same error. This time, I also noticed it in the web console. And I noticed that the stack dump is pretty clearly that the kubernetes apiserver is failing it's healthcheck. (Y'all might have noticed that already)

I opened a google case about it. Between that, and the consistent 12m 30s, something seems fishy.

As discussed in [issue/2022](hashicorp#2022), google is returning some odd data from a node pool create. From what I can tell, the underlying request succeeds, but there’s an apiserver problem. And the health check is failing. So is a pretty coarse hammer to work around this. Hopefully, google will fix it.

directionless · 2018-09-14T14:21:31Z

Google support says they can reproduce this, so that's positive. Meanwhile, I made a patch to ignore that error. I'll PR it if you want, but it's a bit ugly.

master...directionless:workaround-2022

Though my apply now succeeds, I think I'm now running into #1712

danawillow · 2018-09-14T17:01:31Z

Cool, I also filed an issue internally against the team, so hopefully between your issue and mine, we'll be able to get to the bottom of this.

Just in case it was lost in the comments, @cepefernando pointed out that this seems to only happen when autoscaling is configured, so one other thing to try would be to create the node pool without autoscaling, and then add autoscaling in after.

JackFazackerley · 2018-09-16T16:51:32Z

@danawillow I just created a cluster with one node_pool without autoscaling and it was successful. I then added the autoscaling to the existing cluster and it updated in-place successfully. No errors and terraform kept the state of the node_pool.

It's an annoying way around the error but a working one for now.

wibobm · 2018-09-16T21:07:56Z

This happens using the Google console to create a new cluster as well.

guillaumeeb · 2018-09-17T13:56:29Z

Just got this issue without using terraform too...

finished with error: All cluster resources were brought up, but the cluster API is reporting that: component "kube-apiserver" from endpoint "gke-c......" is unhealthy

Legogris · 2018-09-18T15:44:48Z

EDIT: PEBKAC here, just keeping this comment for conversation context.

I am having a different issue which is potentially related: Creating a GKE cluster with Terraform creates no default node pool.

Terraform v0.11.7
Google provider v1.16

resource "google_container_cluster" "y" {
  name               = "y"
  project            = "${google_project.project.project_id}"
  zone               = "us-east1-b"

  additional_zones = [
    "us-east1-c",
    "us-east1-d"
  ]

  initial_node_count = 2

  maintenance_policy {
    daily_maintenance_window {
      start_time = "11:00"
    }
  }

  remove_default_node_pool = true
  node_config {

    machine_type = "n1-standard-1"

    oauth_scopes = [
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/compute"
    ]

    labels {
      stack = "x"
    }

    tags = [ "x" ]
  }
}

JackFazackerley · 2018-09-18T16:06:49Z

@Legogris you're setting remove_default_node_pool = true. Remove that line if you want the default node pool.

Legogris · 2018-09-18T16:32:02Z

@JackFazackerley: Derp, I somehow managed to gloss over that line every time I looked at my template even as I edited it for pasting. Thanks.

directionless · 2018-09-19T21:30:50Z

@wibobm Happens via the console? That super interesting. ~~Do you happen to have a screen shot, or the specifics of the things you set for that?~~ Found a gcloud reproduction. Will write up more tonight

danawillow · 2018-09-20T00:44:52Z

FYI to all- I'm tracking this issue internally and the GKE team is working very hard on it. I'm leaving this issue open since it's not resolved yet, but the issue is not Terraform-specific. I'll update again once I have more I can say.

directionless · 2018-09-20T01:36:12Z

@danawillow Cool. Sounds like y'all have enough of a reproduction. My support ticket has been less productive :)

From the gcloud command line, definitely seems like the resizing you pointed at

JackFazackerley · 2018-09-21T08:16:37Z

Google Cloud support have just got back with a solution for the issue:

Description:
We are investigating an issue with Google Kubernetes Engine. Customers may receive error like: "All cluster resources were brought up, but the cluster API is reporting that: component kube-apiserver from endpoint gke-HASH is unhealthy" when they are creating NodePool with Autoscaling enabled on 1.9.x clusters. We will provide more information by Thursday, 2018-09-20 10:45 US/Pacific.

Workaround:
Customers can work around this by:

Creating a NodePool without Autoscaling, then enabling Autoscaling once that's complete.
Upgrade to 1.10.

edevil · 2018-09-21T08:47:02Z

I'm creating a 1.10 cluster and also have this issue.

JackFazackerley · 2018-09-21T09:00:34Z

@edevil oh... I'll get back to them. Cheers for trying.

edevil · 2018-09-21T09:41:10Z

@JackFazackerley creating the nodepool without autoscaling and enabling it afterwards worked though.

teh · 2018-09-23T15:00:21Z

I also see this with 1.10.

In addition when this error occurs I also see another, potentially related, behaviour where pods scheduled on the first node created (same as kube-dns) can't resolve any DNS queries. pinging other pods works fine though. It's a bit random but maybe it helps someone. (similar report)

JackFazackerley · 2018-09-24T09:25:53Z

Google Cloud support got back to me again with the following:

The issue with Google Kubernetes Engine NodePool has been resolved for all affected users as of Saturday, 2018-09-22 09:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

I have tested this myself and it is working fine.

vncntvandriessche · 2018-09-24T09:30:33Z

@JackFazackerley That's great news!

A big thanks to everyone who was involved with this issue! Never expected this to be handled so quickly.

I'll close this issue as I'd say this is no longer an issue.

ghost · 2018-11-16T14:11:11Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

ghost added the bug label Sep 11, 2018

dsludwig mentioned this issue Sep 11, 2018

Terraform scripts for cloud cluster deployments pangeo-data/pangeo#384

Closed

jamielennox mentioned this issue Sep 12, 2018

nodepool create fails with component "kube-apiserver" from endpoint "gke-XXX" is unhealthy #2034

Closed

JackFazackerley mentioned this issue Sep 12, 2018

kube-apiserver is unhealthy #2036

Closed

directionless mentioned this issue Sep 14, 2018

node-pool version upgrade fails and removes state #2016

Closed

nat-henderson added the upstream label Sep 17, 2018

danawillow mentioned this issue Sep 18, 2018

node-pool autoscaling clause causes Terraform to timeout when creating a zonal GKE cluster #2061

Closed

vncntvandriessche closed this as completed Sep 24, 2018

guillaumeeb mentioned this issue Sep 24, 2018

Use var.zone everywhere dsludwig/pangeo-terraform#1

Merged

dsludwig mentioned this issue Sep 24, 2018

refactoring pangeo.pydata.org pangeo-data/pangeo#373

Closed

ghost locked and limited conversation to collaborators Nov 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to reliably create a GKE cluster using terraform #2022

Impossible to reliably create a GKE cluster using terraform #2022

vncntvandriessche commented Sep 11, 2018

directionless commented Sep 12, 2018

directionless commented Sep 12, 2018

cepefernando commented Sep 12, 2018 •

edited

Loading

nat-henderson commented Sep 12, 2018

jamielennox commented Sep 13, 2018

jamielennox commented Sep 13, 2018

directionless commented Sep 14, 2018

directionless commented Sep 14, 2018 •

edited

Loading

danawillow commented Sep 14, 2018

JackFazackerley commented Sep 16, 2018

wibobm commented Sep 16, 2018

guillaumeeb commented Sep 17, 2018

Legogris commented Sep 18, 2018 •

edited

Loading

JackFazackerley commented Sep 18, 2018

Legogris commented Sep 18, 2018

directionless commented Sep 19, 2018 •

edited

Loading

danawillow commented Sep 20, 2018

directionless commented Sep 20, 2018 •

edited

Loading

JackFazackerley commented Sep 21, 2018

edevil commented Sep 21, 2018

JackFazackerley commented Sep 21, 2018

edevil commented Sep 21, 2018

teh commented Sep 23, 2018

JackFazackerley commented Sep 24, 2018

vncntvandriessche commented Sep 24, 2018

ghost commented Nov 16, 2018

Impossible to reliably create a GKE cluster using terraform #2022

Impossible to reliably create a GKE cluster using terraform #2022

Comments

vncntvandriessche commented Sep 11, 2018

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

directionless commented Sep 12, 2018

directionless commented Sep 12, 2018

cepefernando commented Sep 12, 2018 • edited Loading

nat-henderson commented Sep 12, 2018

jamielennox commented Sep 13, 2018

jamielennox commented Sep 13, 2018

directionless commented Sep 14, 2018

directionless commented Sep 14, 2018 • edited Loading

danawillow commented Sep 14, 2018

JackFazackerley commented Sep 16, 2018

wibobm commented Sep 16, 2018

guillaumeeb commented Sep 17, 2018

Legogris commented Sep 18, 2018 • edited Loading

JackFazackerley commented Sep 18, 2018

Legogris commented Sep 18, 2018

directionless commented Sep 19, 2018 • edited Loading

danawillow commented Sep 20, 2018

directionless commented Sep 20, 2018 • edited Loading

JackFazackerley commented Sep 21, 2018

edevil commented Sep 21, 2018

JackFazackerley commented Sep 21, 2018

edevil commented Sep 21, 2018

teh commented Sep 23, 2018

JackFazackerley commented Sep 24, 2018

vncntvandriessche commented Sep 24, 2018

ghost commented Nov 16, 2018

cepefernando commented Sep 12, 2018 •

edited

Loading

directionless commented Sep 14, 2018 •

edited

Loading

Legogris commented Sep 18, 2018 •

edited

Loading

directionless commented Sep 19, 2018 •

edited

Loading

directionless commented Sep 20, 2018 •

edited

Loading