-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading cluster master after cluster creation completes #3385
Comments
Are you able to consistently reproduce this with a config you can share? We run a pretty extensive # of clusters in our CI environment an this has never come up. |
@rileykarson
After cluster creation complete i always get master node update. P.S. version 1.12.5 - same, and 1.11 too. |
@yellowmegaman I suspect the issue here is that the master is resizing itself. On cluster creation, the master is generally size to only accomodate ~5 nodes. Since you're cluster has 7, the master is being upsized to accomodate those extra nodes. This autoscaling will happen regardless of what the maintenance window is set to since the autoscaling is required to maintain the health of the cluster and isn't really routine. Note that the master currently only scales up so once the master reaches a certain size, it isn't scaled back down. If you need an HA control plane, you should probably look into using a regional cluster. Otherwise, I suspect everything here is WAI. |
@thefirstofthe300 thanks for shedding the light! But still, I can't just wait for terraform to bring up the cluster to continue with further automation here, need to implement some kind of timeout and check, that's what troubles me. Next best thing i can come up with - docs probably should tell about possible resizing, since if you use recommended example with, for instance, batch job with timeout, batch job will fail during cluster resize. |
@thefirstofthe300 I actually ran into the same thing here: #3249
This is exactly what I did but the Terraform provider does not handle this gracefully. The master is working fine on the side of kubernetes but Terraform fails because it is hardcoded to not continue if the cluster is in |
One of the workarounds I found is to set the Keep in mind when calculating |
I've done what Dan recommends, but then when I grow the initial node count
number and apply, tf comes back happy, while cluster gets into RECONCILING
mode (not ready to be happy) as it tries to grow the master. Hence I get a
sleep problem described in OP.
…On Tue, Apr 30, 2019, 16:22 Dan Isla ***@***.***> wrote:
One of the workarounds I found is to set the initial_node_count to match
the expected number of nodes in the managed node pools. This way, the
master is already right-sized after the default node pool is deleted and
the managed pools are created so an upgrade operation is not triggered
after completion.
Keep in mind when calculating initial_node_count with regional clusters
that this value is per zone, so setting it to a value like 2 in a region
with 3 zones creates 6 nodes.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3385 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACHOYIJPPSQIKWK55HCERDPTDIENANCNFSM4HD2TQZQ>
.
|
Thanks @danisla for that workaround. I got it to work by just setting the |
Context here: hashicorp/terraform-provider-google#3385 [#166876339] Signed-off-by: Akshay Mankar <amankar@pivotal.io>
I am also facing this issue. I crated a GKE cluster with below command, and after increasing the load (using JMeter) on my deployed service cluster is doing upgrade of master. |
Hi, I also encountered this issue. my case is also to add two additional nodepools, and it upgrade automatically to block the followed deployment. |
@singh-ajeet looks like you trying in gcloud which is not in scope for this channel. Please raise the issue against glcoud team if you still have the issue |
@yellowmegaman Please let us know if you still face this issue or you want me to close it.
|
@venkykuberan just tried code above again. Had only to edit metadata field, add '=' to make it work with 0.12.X. All is just the same. After cluster and nodepool are created, wait for few mins. In my case it was 2 minutes when cluster started to upgrade: Kubectl returning this:
So if I don't use null resource to wait some time and check connectivity to cluster endpoint, terraform will fail to apply resources to cluster. For anyone interested, currently I'm using this code to bypass this issue, not ideal one, may break on larger nodepools due longer upgrade times:
And so you can add other k8s-related modules here with |
@venkykuberan just tried code above again. Had only to edit metadata field, add '=' to make it work with 0.12.X. All is just the same. After cluster and nodepool are created, wait for few mins. In my case it was 2 minutes when cluster started to upgrade: Kubectl returning this:
So if I don't use null resource to wait some time and check connectivity to cluster endpoint, terraform will fail to apply resources to cluster. For anyone interested, currently I'm using this code to bypass this issue, not ideal one, may break on larger nodepools due longer upgrade times:
And so you can add other k8s-related modules here with |
@yellowmegaman could you try However coming to the core issue, Once Terraform gets the status as cluster complete it will close the loop and it will not have any visibility about what GKE is doing on the cluster in background until next refresh call finds any difference. If GKE modifies the cluster outside of the maintenance window and if it affects your automation flow an issue can be raised directly against GKE here. Since you already have work-around in-place. Shall i go head and close this issue ? |
What I'd love to see is the Terraform provider make a check to if it needs
to wait for an ongoing operation on the cluster before attempting to apply
anything. I'd envision this being solved in some way where the provider
would wait for both the operation and Terraform application to either
complete before the given Terraform application timeout or trigger the
timeout.
In the case of these small clusters that are generally used for some kind
of development purpose, operations should finish up relatively quickly so I
see it alleviating more pain in these kinds of cases and not necessarily
introducing any kind of pain for operators of large clusters since they're
going to be prod clusters that are almost certainly not going to be running
operations outside of Terraform except in cases where a master needs
maintenance due to unhealthiness or some other similar reason.
…On Tue, Jan 21, 2020 at 4:20 PM venkykuberan ***@***.***> wrote:
@yellowmegaman <https://github.com/yellowmegaman> could you try initial_node_count
= 7 on your cluster config and see you can avoid the upgrade as with that
config, state was Running for me all the time (tried for about 5 mins)
when i hit cluster from gcloud after the cluster creation is complete from
terraform.
However coming to the core issue, Once Terraform gets the status as
cluster complete it will close the loop and it will not have any visibility
about what GKE is doing on the cluster in background until next refresh
call finds any difference. If GKE modifies the cluster outside of the
maintenance window and if it affects your automation flow an issue can be
raised directly against GKE here
<https://cloud.google.com/support/docs/issue-trackers>.
Since you already have work-around in-place. Shall i go head and close
this issue ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3385>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYDZ4UIORRLJRVHGPQMSNLQ66GOBANCNFSM4HD2TQZQ>
.
--
Danny Seymour
dannyseeless@gmail.com
|
@venkykuberan Don't see any reason to try with initial node count = 7, since all users have different scenarios. I totally understand that this is happening outside terraform interaction loop, but was hoping to bring this issue to attention of someone developing GCP provider. We can close it, since it isn't terraform fault, but then we'll have to admit that: And is hugely sad. |
@yellowmegaman i didn't mean to use constant value 7 for the node count. I wanted to try the same node_count for default node pool as well, so that master is sized correctly in the initial creation time (doesn't have to resize it later). Also please let us know |
After node pools are added, the cluster begins to scale-up and can cause inaccessibility to the k8s master URL. Workaround from hashicorp/terraform-provider-google#3385
After node pools are added, the cluster begins to scale-up and can cause inaccessibility to the k8s master URL. Workaround from hashicorp/terraform-provider-google#3385
@yellowmegaman do you still want to keep the issue open ? or shall i close if you aren't looking anything from us here? |
No response. Assuming it is no longer an issue |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
Description
Currently when creating google_container_cluster the recommended way (separate node_pool), terraform signals that everything is OK, but after only a few moments, cluster is doing upgrade that is not configurable by gke maint window.
So if I'm using terraform to create GKE cluster I can't be sure that the moment terraform is done, everything is OK and proceed with some automation.
Idea: add additional (configurable) timeout, after which terraform checks if cluster is available.
Currently even endpoint is unavail when performing upgrades.
New or Affected Resource(s)
Potential Terraform Configuration
Any recommended configuration from https://www.terraform.io/docs/providers/google/r/container_cluster.html
References
The text was updated successfully, but these errors were encountered: