Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to reliably create a GKE cluster using terraform #2022

Closed
vncntvandriessche opened this issue Sep 11, 2018 · 26 comments
Closed

Impossible to reliably create a GKE cluster using terraform #2022

vncntvandriessche opened this issue Sep 11, 2018 · 26 comments

Comments

@vncntvandriessche
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
  • If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

❯❯❯ terraform -v
Terraform v0.11.8
+ provider.google v1.17.1

Affected Resource(s)

As far as I've tested, the following resources at the least are affected:

  • google_container_cluster
  • google_container_node_pool

Terraform Configuration Files

provider "google" {
  credentials = "${file(".account.json")}"
  project     = "example-001"
  region      = "europe-west1"
}

resource "google_container_cluster" "production-001" {
  name               = "production-001"
  zone               = "europe-west1-c"
  initial_node_count = 3

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }
}

resource "google_container_node_pool" "webpool-001" {
  name    = "webpool-001"
  cluster = "${google_container_cluster.production-001.name}"
  zone    = "europe-west1-c"

  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }

  node_count = 3
  autoscaling {
    min_node_count = 3
    max_node_count = 10
  }

  management {
    auto_repair  = false
    auto_upgrade = false
  }
}

Debug Output

https://gist.github.com/vncntvandriessche/84c404a4950eb35abe6b3099ef8cc435

Panic Output

Expected Behavior

I expected terraform to build the GKE cluster and attach the matching node pool without failures due to api errors.

Actual Behavior

We are getting a broken TF-state due to the api reporting an error

Steps to Reproduce

  1. terraform init
  2. terraform apply

Important Factoids

  • If we run apply again after this failure, terraform will fail due to the pool already existing, but it's not been registered into the state.

References

  • #0000
@directionless
Copy link

I think I'm hitting this, but with a slightly different set of actual behaviors.

  • The first run, terraform creates the cluster and the node pool, and then panics as described.
  • Apparently unknown to terraform, the cluster and node pool are created
  • the next run destroys the cluster and node pool, and attempts to make them anew
  • triggering the same panic.

If I look at the GKE webui, sometimes it tells me it's resizing the master server, other times that it's creating the node pool. In non-TF experience, I have found that changing node pools can result in long apiserver unavailability as it goes through resizing

For me, it pretty consistently fails at 13 minutes. Which seems like It's acting a bit like there's a timeout. But, it looks like the underlying code has a 30min timeout. So that seems like an interesting discrepancy.

@directionless
Copy link

Testing some more.... the google_container_cluster seems fine, it's the addition of google_container_node_pool that causes errors.

If I comment out google_container_node_pool it applies fine. I get a GKE cluster, etc. But if I add that back in the apply bombs out at 13min, the node pool is created anyhow. Subsequent applies remove the prior node pool, then timeout at 13min and repeat the cycle

@cepefernando
Copy link

cepefernando commented Sep 12, 2018

I have faced the same issue, after some troubleshooting I have noticed that when the node pool has the autoscaling parameter this error appears, as a temporary fix, if you remove that node pool and add a node pool without the autoscaling enabled it should work.

@nat-henderson
Copy link
Contributor

Yes, this is an unfortunate error being returned from GKE because the configuration you're pushing is causing it to be unavailable at the 10m mark (which I believe is the current timeout). If you believe that @directionless is correct and that the apiserver will become available again sometime after that, you can increase the timeout for create (or update, if you're hitting this on update) to a sufficiently long window. As a non-k8s expert, I unfortunately can't say for sure, but it certainly feels right. :)

Google's Terraform provider cannot validate your GKE config - there are too many possible configurations for us to be confident we are blocking the ones that will not work while allowing all valid configs. The only change we can really make is to make sure that the node pool does end up in state. I'm happy to add that. I'll try to figure that out and send a PR.

@jamielennox
Copy link
Contributor

So I don't think it's a timeout issue. That's 30 minutes for create already (might want to set this the same for update)

		Timeouts: &schema.ResourceTimeout{
			Create: schema.DefaultTimeout(30 * time.Minute),
			Update: schema.DefaultTimeout(10 * time.Minute),
			Delete: schema.DefaultTimeout(10 * time.Minute),
		},

The problem seems to be that the API returns done. The logs start overwriting each other so i got this last message from mitmproxy:

{
    "detail": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "endTime": "2018-09-13T01:36:18.837939633Z",
    "name": "operation-1536801745861-34ba47a8",
    "operationType": "CREATE_NODE_POOL",
    "selfLink": "https://container.googleapis.com/v1beta1/projects/1111111/zones/australia-southeast1-a/operations/operation-1536801745861-34ba47a8",
    "startTime": "2018-09-13T01:22:25.861642499Z",
    "status": "DONE",
    "statusMessage": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "targetLink": "https://container.googleapis.com/v1beta1/projects/111111/zones/australia-southeast1-a/clusters/jamie-test/nodePools/jamie-test-nodes",
    "zone": "australia-southeast1-a"
}

So DONE with a statusMessage is being passed back as a failure from the command. Our choices would seem to be either to ignore the error and try and fetch the nodepool information again from google - or figure out why google APIs changed to start returning a failure.

@jamielennox
Copy link
Contributor

note that this seems to happen regardless of the remove_default_node_pool setting which I could see how that might cause the API to not be ready yet.

@directionless
Copy link

I started looking at this again.

I ran TF apply, and 12m 30s later, got the same error. This time, I also noticed it in the web console. And I noticed that the stack dump is pretty clearly that the kubernetes apiserver is failing it's healthcheck. (Y'all might have noticed that already)

I opened a google case about it. Between that, and the consistent 12m 30s, something seems fishy.

directionless added a commit to directionless/terraform-provider-google that referenced this issue Sep 14, 2018
As discussed in [issue/2022](hashicorp#2022), google is returning some odd data from a node pool create. 

From what I can tell, the underlying request succeeds, but there’s an apiserver problem. And the health check is failing. So is a pretty coarse hammer to work around this. Hopefully, google will fix it.
@directionless
Copy link

directionless commented Sep 14, 2018

Google support says they can reproduce this, so that's positive. Meanwhile, I made a patch to ignore that error. I'll PR it if you want, but it's a bit ugly.

master...directionless:workaround-2022

Though my apply now succeeds, I think I'm now running into #1712

@danawillow
Copy link
Contributor

Cool, I also filed an issue internally against the team, so hopefully between your issue and mine, we'll be able to get to the bottom of this.

Just in case it was lost in the comments, @cepefernando pointed out that this seems to only happen when autoscaling is configured, so one other thing to try would be to create the node pool without autoscaling, and then add autoscaling in after.

@JackFazackerley
Copy link

@danawillow I just created a cluster with one node_pool without autoscaling and it was successful. I then added the autoscaling to the existing cluster and it updated in-place successfully. No errors and terraform kept the state of the node_pool.

It's an annoying way around the error but a working one for now.

@wibobm
Copy link

wibobm commented Sep 16, 2018

This happens using the Google console to create a new cluster as well.

@guillaumeeb
Copy link

Just got this issue without using terraform too...

finished with error: All cluster resources were brought up, but the cluster API is reporting that: component "kube-apiserver" from endpoint "gke-c......" is unhealthy

@Legogris
Copy link

Legogris commented Sep 18, 2018

EDIT: PEBKAC here, just keeping this comment for conversation context.

I am having a different issue which is potentially related: Creating a GKE cluster with Terraform creates no default node pool.

Terraform v0.11.7
Google provider v1.16

resource "google_container_cluster" "y" {
  name               = "y"
  project            = "${google_project.project.project_id}"
  zone               = "us-east1-b"

  additional_zones = [
    "us-east1-c",
    "us-east1-d"
  ]

  initial_node_count = 2

  maintenance_policy {
    daily_maintenance_window {
      start_time = "11:00"
    }
  }

  remove_default_node_pool = true
  node_config {

    machine_type = "n1-standard-1"

    oauth_scopes = [
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/compute"
    ]

    labels {
      stack = "x"
    }

    tags = [ "x" ]
  }
}

@JackFazackerley
Copy link

@Legogris you're setting remove_default_node_pool = true. Remove that line if you want the default node pool.

@Legogris
Copy link

@JackFazackerley: Derp, I somehow managed to gloss over that line every time I looked at my template even as I edited it for pasting. Thanks.

@directionless
Copy link

directionless commented Sep 19, 2018

@wibobm Happens via the console? That super interesting. Do you happen to have a screen shot, or the specifics of the things you set for that? Found a gcloud reproduction. Will write up more tonight

@danawillow
Copy link
Contributor

FYI to all- I'm tracking this issue internally and the GKE team is working very hard on it. I'm leaving this issue open since it's not resolved yet, but the issue is not Terraform-specific. I'll update again once I have more I can say.

@directionless
Copy link

directionless commented Sep 20, 2018

@danawillow Cool. Sounds like y'all have enough of a reproduction. My support ticket has been less productive :)

From the gcloud command line, definitely seems like the resizing you pointed at

@JackFazackerley
Copy link

Google Cloud support have just got back with a solution for the issue:

Description:
We are investigating an issue with Google Kubernetes Engine. Customers may receive error like: "All cluster resources were brought up, but the cluster API is reporting that: component kube-apiserver from endpoint gke-HASH is unhealthy" when they are creating NodePool with Autoscaling enabled on 1.9.x clusters. We will provide more information by Thursday, 2018-09-20 10:45 US/Pacific.

Workaround:
Customers can work around this by:

  1. Creating a NodePool without Autoscaling, then enabling Autoscaling once that's complete.
  2. Upgrade to 1.10.

@edevil
Copy link

edevil commented Sep 21, 2018

I'm creating a 1.10 cluster and also have this issue.

@JackFazackerley
Copy link

@edevil oh... I'll get back to them. Cheers for trying.

@edevil
Copy link

edevil commented Sep 21, 2018

@JackFazackerley creating the nodepool without autoscaling and enabling it afterwards worked though.

@teh
Copy link

teh commented Sep 23, 2018

I also see this with 1.10.

In addition when this error occurs I also see another, potentially related, behaviour where pods scheduled on the first node created (same as kube-dns) can't resolve any DNS queries. pinging other pods works fine though. It's a bit random but maybe it helps someone. (similar report)

@JackFazackerley
Copy link

Google Cloud support got back to me again with the following:

The issue with Google Kubernetes Engine NodePool has been resolved for all affected users as of Saturday, 2018-09-22 09:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

I have tested this myself and it is working fine.

@vncntvandriessche
Copy link
Author

@JackFazackerley That's great news!

A big thanks to everyone who was involved with this issue! Never expected this to be handled so quickly.

I'll close this issue as I'd say this is no longer an issue.

@ghost
Copy link

ghost commented Nov 16, 2018

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests