Make wait-for-cluster logic more robust #676

MrBlaise · 2020-09-16T15:40:06Z

The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.

Note: if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.

comment-bot-dev · 2020-09-16T15:52:13Z

Thanks for the PR! 🚀
✅ Lint checks have passed.

bharathkkb

Thanks for the PR. This makes sense although I am curious. The provider should ensure that we are always past the PROVISIONING state. Did you encounter this particular case where the cluster was provisioned and Terraform exited successfully but the cluster was still in a PROVISIONING state?

if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.

We wait for the new nodepool to be provisioned before running the wait for cluster. Does that help?

MrBlaise · 2020-09-16T18:23:48Z

Thanks for the PR. This makes sense although I am curious. The provider should ensure that we are always past the PROVISIONING state. Did you encounter this particular case where the cluster was provisioned and Terraform exited successfully but the cluster was still in a PROVISIONING state?

if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.

We wait for the new nodepool to be provisioned before running the wait for cluster. Does that help?

You are right, my bad. I used the same wait-for-cluster logic from the module for a different module of mine and I switched up the two and I thought I encountered this error in the gke module. It works well, it only starts checking after the node pool is created.

It might still be worth considering my PR in my opinion because we are waiting for the cluster to be RUNNING rather than waiting for it to be not RECONCILING. At least to me that makes more sense.

bharathkkb

LGTM
/cc @morgante

morgante

This logic looks more robust, though I see tests are failing?

bharathkkb · 2020-09-18T00:38:44Z

@morgante ah yes these were not made in autogen.
@MrBlaise could you make these changes here and run make build.

MrBlaise · 2020-09-18T06:08:10Z

@bharathkkb Sure, I’ll do it today :)

The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.

MrBlaise · 2020-09-18T06:26:28Z

@morgante ah yes these were not made in autogen.
@MrBlaise could you make these changes here and run make build.

Should be done.

bharathkkb · 2020-09-18T21:24:00Z

/gcbrun

greg-bumped · 2020-10-20T19:40:23Z

I don't know if it's somehow just me, but this change to make the script more robust breaks the functionality that worked for terraform cloud. I've tried a variety of workarounds, including "skip_provisioners" and nothing worked.
Gets stuck in the loop of "gcloud not found". The script needs to revert to previous behavior if it can't find gcloud!
This is not robust enough.

Looking at the changes, I think I may be confused, the gcloud requirement already existed at this point. Apologies...
Here's a snippet from the previous version:
module.gke.google_container_cluster.primary: Still modifying... [id=projects/engineering-13/location...clusters/eng-gke-on-vpc-cluster-c57691, 10s elapsed] module.gke.google_container_cluster.primary: Still modifying... [id=projects/engineering-13/location...clusters/eng-gke-on-vpc-cluster-c57691, 20s elapsed] module.gke.google_container_cluster.primary: Still modifying... [id=projects/engineering-13/location...clusters/eng-gke-on-vpc-cluster-c57691, 30s elapsed]

And the version that fails:
..terraform/modules/gke/modules/private-cluster/scripts/wait-for-cluster.sh: line 29: gcloud: command not found ..terraform/modules/gke/modules/private-cluster/scripts/wait-for-cluster.sh: line 29: gcloud: command not found ..terraform/modules/gke/modules/private-cluster/scripts/wait-for-cluster.sh: line 29: gcloud: command not found

greg-bumped · 2020-10-20T19:59:36Z

The specific line that changed and has caused an infinite loop is:
[[ "${current_status}" == "RECONCILING" ]]
which became
[[ "${current_status}" != "RUNNING" ]]
And in the case where terraform does NOT have access to gcloud (such as in terraform cloud) this causes an infinite loop, because current_status will never be "RUNNING"...

I did try setting "skip_provisioners = true" in the private cluster module, but it still called /wait-for-cluster.sh: and looped indefinitely.

Additional reference: https://xkcd.com/1172/

morgante · 2020-10-20T20:13:45Z

@greg-bumped gcloud is a dependency, your issue isn't related to this change.

Please review the update notes about how to handle forcing the install of gcloud: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/docs/upgrading_to_v12.0.md#dropped-support-for-gcloud_skip_download-variable

greg-bumped · 2020-10-20T20:56:53Z

The other modules seem to be interacting with gcloud just fine. But this script, for some reason beyond me, doesn't seem to be running in a shell that has access to it. And the nature of terraform cloud isn't such that I can remote to it for diagnostics or to see why. This issue was swallowed previously, but this fix seems to have brought it to the forefront, and I've had to lock into a previous version.

How can it deploy clusters previously, but it's the script that causes an issue? Does that still mean that gcloud installation is the issue? It's getting the service account credentials from an env variable for the other infrastructure that's being deployed.

I apologize for my lack of experience in this regard... When I did give the script access to gcloud, it generated a different set of errors:

to select an already authenticated account to use.
.WARNING: Could not open the configuration file: [/home/terraform/.config/gcloud/configurations/config_default].
ERROR: (gcloud.container.clusters.list) You do not currently have an active account selected.
Please run:

  $ gcloud auth login

to obtain new credentials, or if you have already logged in with a
different account:

  $ gcloud config set account ACCOUNT

Is this a bug in terraform cloud? I can't emphasize enough that this script is the only failure point.

Thanks for your time.

morgante · 2020-10-20T21:03:11Z

@greg-bumped This script uses gcloud, but your error isn't related to the change made in this PR. It has always required gcloud.

You need to do two things:

Ensure gcloud is installed (either by adding it to your environment or setting the environment variable GCLOUD_TF_DOWNLOAD="always").
Ensure gcloud is authenticated. This can be done by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.

greg-bumped · 2020-10-20T21:38:16Z

GOOGLE_APPLICATION_CREDENTIALS is set, otherwise none of the infrastructure would be deployed.
When I set this:

  source      = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
  version     = "~> 11.1.0"

Everything works. Wasn't gcloud always a requirement? I don't think anything would deploy in the above situation if the two suggestions you made weren't already moot?

The authentication issue seems unique to the script. Even when I set GCLOUD_TF_DOWNLOAD, the version of gcloud the script runs doesn't seem to be leveraging the GOOGLE_APPLICATION_CREDENTIALS environment variable. Where I know the rest of the terraform scripting does. I hope I'm making sense, I can tell I wasn't being clear earlier, I should have mentioned explicitly that I'd done those steps, and that's what caused the auth errors.

I don't think the terraform would get past the first line of configuration without the credentials set, as it uses them to collect state, doesn't it? How could it even get to the script at all if it wasn't configured?

morgante · 2020-10-20T21:56:51Z

@greg-bumped We changed how gcloud is installed/used so that might be why.

Terraform and gcloud actually authenticate slightly differently

Can you try setting the CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE environment variable?

The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up. Co-authored-by: Bharath KKB <bharathkrishnakb@gmail.com>

MrBlaise requested review from bharathkkb, Jberlinsky and a team as code owners September 16, 2020 15:40

bharathkkb reviewed Sep 16, 2020

View reviewed changes

bharathkkb approved these changes Sep 18, 2020

View reviewed changes

morgante approved these changes Sep 18, 2020

View reviewed changes

bharathkkb mentioned this pull request Sep 18, 2020

Investigate CFT comment bot false positives GoogleCloudPlatform/cloud-foundation-toolkit#795

Closed

make wait-for-cluster more robust

581477c

The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.

MrBlaise changed the title ~~Fix wait-for-cluster logic~~ Make wait-for-cluster logic more robust Sep 18, 2020

MrBlaise requested a review from bharathkkb September 18, 2020 06:26

Merge branch 'master' into patch-1

f6d9107

bharathkkb merged commit dffb047 into terraform-google-modules:master Sep 19, 2020

MrBlaise deleted the patch-1 branch September 20, 2020 14:43

dpetersen mentioned this pull request Sep 21, 2020

Set auto_provisioning_defaults.service_account #639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make wait-for-cluster logic more robust #676

Make wait-for-cluster logic more robust #676

MrBlaise commented Sep 16, 2020

comment-bot-dev commented Sep 16, 2020 •

edited

Loading

bharathkkb left a comment

MrBlaise commented Sep 16, 2020

bharathkkb left a comment

morgante left a comment

bharathkkb commented Sep 18, 2020

MrBlaise commented Sep 18, 2020

MrBlaise commented Sep 18, 2020

bharathkkb commented Sep 18, 2020

greg-bumped commented Oct 20, 2020 •

edited

Loading

greg-bumped commented Oct 20, 2020 •

edited

Loading

morgante commented Oct 20, 2020

greg-bumped commented Oct 20, 2020 •

edited

Loading

morgante commented Oct 20, 2020

greg-bumped commented Oct 20, 2020 •

edited

Loading

morgante commented Oct 20, 2020

Make wait-for-cluster logic more robust #676

Make wait-for-cluster logic more robust #676

Conversation

MrBlaise commented Sep 16, 2020

comment-bot-dev commented Sep 16, 2020 • edited Loading

bharathkkb left a comment

Choose a reason for hiding this comment

MrBlaise commented Sep 16, 2020

bharathkkb left a comment

Choose a reason for hiding this comment

morgante left a comment

Choose a reason for hiding this comment

bharathkkb commented Sep 18, 2020

MrBlaise commented Sep 18, 2020

MrBlaise commented Sep 18, 2020

bharathkkb commented Sep 18, 2020

greg-bumped commented Oct 20, 2020 • edited Loading

greg-bumped commented Oct 20, 2020 • edited Loading

morgante commented Oct 20, 2020

greg-bumped commented Oct 20, 2020 • edited Loading

morgante commented Oct 20, 2020

greg-bumped commented Oct 20, 2020 • edited Loading

morgante commented Oct 20, 2020

comment-bot-dev commented Sep 16, 2020 •

edited

Loading

greg-bumped commented Oct 20, 2020 •

edited

Loading

greg-bumped commented Oct 20, 2020 •

edited

Loading

greg-bumped commented Oct 20, 2020 •

edited

Loading

greg-bumped commented Oct 20, 2020 •

edited

Loading