-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impossible to reliably create a GKE cluster using terraform #2022
Comments
I think I'm hitting this, but with a slightly different set of actual behaviors.
If I look at the GKE webui, sometimes it tells me it's resizing the master server, other times that it's creating the node pool. In non-TF experience, I have found that changing node pools can result in long apiserver unavailability as it goes through resizing For me, it pretty consistently fails at 13 minutes. Which seems like It's acting a bit like there's a timeout. But, it looks like the underlying code has a 30min timeout. So that seems like an interesting discrepancy. |
Testing some more.... the If I comment out |
I have faced the same issue, after some troubleshooting I have noticed that when the node pool has the autoscaling parameter this error appears, as a temporary fix, if you remove that node pool and add a node pool without the autoscaling enabled it should work. |
Yes, this is an unfortunate error being returned from GKE because the configuration you're pushing is causing it to be unavailable at the 10m mark (which I believe is the current timeout). If you believe that @directionless is correct and that the apiserver will become available again sometime after that, you can increase the timeout for create (or update, if you're hitting this on update) to a sufficiently long window. As a non-k8s expert, I unfortunately can't say for sure, but it certainly feels right. :) Google's Terraform provider cannot validate your GKE config - there are too many possible configurations for us to be confident we are blocking the ones that will not work while allowing all valid configs. The only change we can really make is to make sure that the node pool does end up in state. I'm happy to add that. I'll try to figure that out and send a PR. |
So I don't think it's a timeout issue. That's 30 minutes for create already (might want to set this the same for update)
The problem seems to be that the API returns done. The logs start overwriting each other so i got this last message from mitmproxy:
So |
note that this seems to happen regardless of the |
I started looking at this again. I ran TF apply, and 12m 30s later, got the same error. This time, I also noticed it in the web console. And I noticed that the stack dump is pretty clearly that the kubernetes apiserver is failing it's healthcheck. (Y'all might have noticed that already) I opened a google case about it. Between that, and the consistent 12m 30s, something seems fishy. |
As discussed in [issue/2022](hashicorp#2022), google is returning some odd data from a node pool create. From what I can tell, the underlying request succeeds, but there’s an apiserver problem. And the health check is failing. So is a pretty coarse hammer to work around this. Hopefully, google will fix it.
Google support says they can reproduce this, so that's positive. Meanwhile, I made a patch to ignore that error. I'll PR it if you want, but it's a bit ugly. master...directionless:workaround-2022 Though my apply now succeeds, I think I'm now running into #1712 |
Cool, I also filed an issue internally against the team, so hopefully between your issue and mine, we'll be able to get to the bottom of this. Just in case it was lost in the comments, @cepefernando pointed out that this seems to only happen when autoscaling is configured, so one other thing to try would be to create the node pool without autoscaling, and then add autoscaling in after. |
@danawillow I just created a cluster with one node_pool without autoscaling and it was successful. I then added the autoscaling to the existing cluster and it updated in-place successfully. No errors and terraform kept the state of the node_pool. It's an annoying way around the error but a working one for now. |
This happens using the Google console to create a new cluster as well. |
Just got this issue without using terraform too...
|
EDIT: PEBKAC here, just keeping this comment for conversation context. I am having a different issue which is potentially related: Creating a GKE cluster with Terraform creates no default node pool. Terraform v0.11.7
|
@Legogris you're setting |
@JackFazackerley: Derp, I somehow managed to gloss over that line every time I looked at my template even as I edited it for pasting. Thanks. |
@wibobm Happens via the console? That super interesting. |
FYI to all- I'm tracking this issue internally and the GKE team is working very hard on it. I'm leaving this issue open since it's not resolved yet, but the issue is not Terraform-specific. I'll update again once I have more I can say. |
@danawillow Cool. Sounds like y'all have enough of a reproduction. My support ticket has been less productive :) From the |
Google Cloud support have just got back with a solution for the issue: Description: Workaround:
|
I'm creating a 1.10 cluster and also have this issue. |
@edevil oh... I'll get back to them. Cheers for trying. |
@JackFazackerley creating the nodepool without autoscaling and enabling it afterwards worked though. |
I also see this with 1.10. In addition when this error occurs I also see another, potentially related, behaviour where pods scheduled on the first node created (same as kube-dns) can't resolve any DNS queries. pinging other pods works fine though. It's a bit random but maybe it helps someone. (similar report) |
Google Cloud support got back to me again with the following: The issue with Google Kubernetes Engine NodePool has been resolved for all affected users as of Saturday, 2018-09-22 09:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. I have tested this myself and it is working fine. |
@JackFazackerley That's great news! A big thanks to everyone who was involved with this issue! Never expected this to be handled so quickly. I'll close this issue as I'd say this is no longer an issue. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks! |
Community Note
Terraform Version
Affected Resource(s)
As far as I've tested, the following resources at the least are affected:
google_container_cluster
google_container_node_pool
Terraform Configuration Files
Debug Output
https://gist.github.com/vncntvandriessche/84c404a4950eb35abe6b3099ef8cc435
Panic Output
Expected Behavior
I expected terraform to build the GKE cluster and attach the matching node pool without failures due to api errors.
Actual Behavior
We are getting a broken TF-state due to the api reporting an error
Steps to Reproduce
terraform init
terraform apply
Important Factoids
References
The text was updated successfully, but these errors were encountered: