Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make wait-for-cluster logic more robust #676

Merged
merged 2 commits into from
Sep 19, 2020
Merged

Make wait-for-cluster logic more robust #676

merged 2 commits into from
Sep 19, 2020

Conversation

MrBlaise
Copy link
Contributor

The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.

Note: if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.

@comment-bot-dev
Copy link

comment-bot-dev commented Sep 16, 2020

Thanks for the PR! 🚀
✅ Lint checks have passed.

Copy link
Member

@bharathkkb bharathkkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This makes sense although I am curious. The provider should ensure that we are always past the PROVISIONING state. Did you encounter this particular case where the cluster was provisioned and Terraform exited successfully but the cluster was still in a PROVISIONING state?

if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.

We wait for the new nodepool to be provisioned before running the wait for cluster. Does that help?

@MrBlaise
Copy link
Contributor Author

Thanks for the PR. This makes sense although I am curious. The provider should ensure that we are always past the PROVISIONING state. Did you encounter this particular case where the cluster was provisioned and Terraform exited successfully but the cluster was still in a PROVISIONING state?

if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.

We wait for the new nodepool to be provisioned before running the wait for cluster. Does that help?

You are right, my bad. I used the same wait-for-cluster logic from the module for a different module of mine and I switched up the two and I thought I encountered this error in the gke module. It works well, it only starts checking after the node pool is created.

It might still be worth considering my PR in my opinion because we are waiting for the cluster to be RUNNING rather than waiting for it to be not RECONCILING. At least to me that makes more sense.

Copy link
Member

@bharathkkb bharathkkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
/cc @morgante

Copy link
Contributor

@morgante morgante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks more robust, though I see tests are failing?

@bharathkkb
Copy link
Member

@morgante ah yes these were not made in autogen.
@MrBlaise could you make these changes here and run make build.

@MrBlaise
Copy link
Contributor Author

@bharathkkb Sure, I’ll do it today :)

The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.
@MrBlaise MrBlaise changed the title Fix wait-for-cluster logic Make wait-for-cluster logic more robust Sep 18, 2020
@MrBlaise
Copy link
Contributor Author

@morgante ah yes these were not made in autogen.
@MrBlaise could you make these changes here and run make build.

Should be done.

@bharathkkb
Copy link
Member

/gcbrun

@bharathkkb bharathkkb merged commit dffb047 into terraform-google-modules:master Sep 19, 2020
@MrBlaise MrBlaise deleted the patch-1 branch September 20, 2020 14:43
@greg-bumped
Copy link

greg-bumped commented Oct 20, 2020

I don't know if it's somehow just me, but this change to make the script more robust breaks the functionality that worked for terraform cloud. I've tried a variety of workarounds, including "skip_provisioners" and nothing worked.
Gets stuck in the loop of "gcloud not found". The script needs to revert to previous behavior if it can't find gcloud!
This is not robust enough.

Looking at the changes, I think I may be confused, the gcloud requirement already existed at this point. Apologies...
Here's a snippet from the previous version:
module.gke.google_container_cluster.primary: Still modifying... [id=projects/engineering-13/location...clusters/eng-gke-on-vpc-cluster-c57691, 10s elapsed] module.gke.google_container_cluster.primary: Still modifying... [id=projects/engineering-13/location...clusters/eng-gke-on-vpc-cluster-c57691, 20s elapsed] module.gke.google_container_cluster.primary: Still modifying... [id=projects/engineering-13/location...clusters/eng-gke-on-vpc-cluster-c57691, 30s elapsed]

And the version that fails:
..terraform/modules/gke/modules/private-cluster/scripts/wait-for-cluster.sh: line 29: gcloud: command not found ..terraform/modules/gke/modules/private-cluster/scripts/wait-for-cluster.sh: line 29: gcloud: command not found ..terraform/modules/gke/modules/private-cluster/scripts/wait-for-cluster.sh: line 29: gcloud: command not found

@greg-bumped
Copy link

greg-bumped commented Oct 20, 2020

The specific line that changed and has caused an infinite loop is:
[[ "${current_status}" == "RECONCILING" ]]
which became
[[ "${current_status}" != "RUNNING" ]]
And in the case where terraform does NOT have access to gcloud (such as in terraform cloud) this causes an infinite loop, because current_status will never be "RUNNING"...

I did try setting "skip_provisioners = true" in the private cluster module, but it still called /wait-for-cluster.sh: and looped indefinitely.

Additional reference: https://xkcd.com/1172/

@morgante
Copy link
Contributor

@greg-bumped gcloud is a dependency, your issue isn't related to this change.

Please review the update notes about how to handle forcing the install of gcloud: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/docs/upgrading_to_v12.0.md#dropped-support-for-gcloud_skip_download-variable

@greg-bumped
Copy link

greg-bumped commented Oct 20, 2020

The other modules seem to be interacting with gcloud just fine. But this script, for some reason beyond me, doesn't seem to be running in a shell that has access to it. And the nature of terraform cloud isn't such that I can remote to it for diagnostics or to see why. This issue was swallowed previously, but this fix seems to have brought it to the forefront, and I've had to lock into a previous version.

How can it deploy clusters previously, but it's the script that causes an issue? Does that still mean that gcloud installation is the issue? It's getting the service account credentials from an env variable for the other infrastructure that's being deployed.

I apologize for my lack of experience in this regard... When I did give the script access to gcloud, it generated a different set of errors:

to select an already authenticated account to use.
.WARNING: Could not open the configuration file: [/home/terraform/.config/gcloud/configurations/config_default].
ERROR: (gcloud.container.clusters.list) You do not currently have an active account selected.
Please run:

  $ gcloud auth login

to obtain new credentials, or if you have already logged in with a
different account:

  $ gcloud config set account ACCOUNT

Is this a bug in terraform cloud? I can't emphasize enough that this script is the only failure point.

Thanks for your time.

@morgante
Copy link
Contributor

@greg-bumped This script uses gcloud, but your error isn't related to the change made in this PR. It has always required gcloud.

You need to do two things:

  1. Ensure gcloud is installed (either by adding it to your environment or setting the environment variable GCLOUD_TF_DOWNLOAD="always").
  2. Ensure gcloud is authenticated. This can be done by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.

@greg-bumped
Copy link

greg-bumped commented Oct 20, 2020

GOOGLE_APPLICATION_CREDENTIALS is set, otherwise none of the infrastructure would be deployed.
When I set this:

  source      = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
  version     = "~> 11.1.0"

Everything works. Wasn't gcloud always a requirement? I don't think anything would deploy in the above situation if the two suggestions you made weren't already moot?

The authentication issue seems unique to the script. Even when I set GCLOUD_TF_DOWNLOAD, the version of gcloud the script runs doesn't seem to be leveraging the GOOGLE_APPLICATION_CREDENTIALS environment variable. Where I know the rest of the terraform scripting does. I hope I'm making sense, I can tell I wasn't being clear earlier, I should have mentioned explicitly that I'd done those steps, and that's what caused the auth errors.

I don't think the terraform would get past the first line of configuration without the credentials set, as it uses them to collect state, doesn't it? How could it even get to the script at all if it wasn't configured?

@morgante
Copy link
Contributor

@greg-bumped We changed how gcloud is installed/used so that might be why.

Terraform and gcloud actually authenticate slightly differently

Can you try setting the CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE environment variable?

CPL-markus pushed a commit to WALTER-GROUP/terraform-google-kubernetes-engine that referenced this pull request Jul 15, 2024
The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.

Co-authored-by: Bharath KKB <bharathkrishnakb@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants