Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kubectl provider in root module for blueprint with GKE cluster module setup #3406

Conversation

mohitchaurasia91
Copy link
Contributor

@mohitchaurasia91 mohitchaurasia91 commented Dec 16, 2024

Why we need this change ?

When managing Kubernetes manifests with multiple blueprints using the kubectl-apply module, completely removing a kubectl-apply block from a blueprint results in a "Error: Provider configuration not present" error.

  - id: workload_manager_install
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      kueue:
        install: true
      jobset:
        install: true

  - id: workload_manager_config
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      apply_manifests:
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/config-map.yaml.tftpl
        template_vars: {name: "a3plus-benchmark-resources-configmap", num_nodes: "1"}
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/kueue-credentials.yaml.tftpl
        template_vars: {num_chips: "8"}

Similar issue is observed for PersistentVolume and PersistentVolumeClaims, when provisioning these resources using gke-persistent-volume module, which internally uses kubectl-apply module for k8s manifest deployment.

  - id: data-bucket-pv
    source: modules/file-system/gke-persistent-volume
    use: [gke_cluster, data-bucket]
    settings: {capacity_gb: 5000}
    
 - id: shared-filestore-pv
    source: modules/file-system/gke-persistent-volume
    use: [gke_cluster, filestore]

Removing workload_manager_config or persistent_volume and recreating and redeploying gives us the error:

Error: Provider configuration not present

To work with
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan) its original provider configuration at
module.workload_manager_config.provider["registry.terraform.io/gavinbunney/kubectl"]
is required, but it has been removed. This occurs when a provider
configuration is removed while objects created by that provider still exist
in the state. Re-add the provider configuration to destroy
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan), after which you can remove the provider configuration again.

RCA

The issue stems from how the provider is defined. Unlike other Terraform providers which are typically defined at the root level, the kubectl-apply provider is defined within the modules/management/kubectl-apply/providers.tf file inside the module itself.

This becomes problematic when deleting a blueprint configuration that uses this module. While the module's folder remains, Terraform effectively loses track of it and its associated provider. Consequently, Terraform can't determine the correct provider (kubectl_apply_manifests) to use when attempting to delete the resources associated with the removed module.

This behavior is due to Terraform's handling of child modules and their providers. You can find a more detailed explanation in this Stack Overflow answer: https://stackoverflow.com/a/58403262

Solution

Moving kubectl provider to the root module (under TerraformProviders block) of the blueprint group, if the group contains dependent module of gke-cluster or pre-existing-gke-cluster module.

Backward Compatibility Test

Tested this out for example blueprint (storage-gke.yaml, gke-a3-highgpu.yaml), following are the steps.

  • on develop branch, perform gcluster deploy for example blueprint.
  • switched over to feature branch and performed make to update gcluster binary.
  • performed gcluster deploy, got a warning of past deployment was performed on diff gcluster version, suggesting to use --force based on user discretion.
Creating deployment folder "gke-a3-highgpu" ...
Error: ghpc_version has changed from "45068640-dirty" to "90aa74bb-dirty", using different versions of GHPC to update a live deployment is not officially supported
Hint: Use `--force` to overwrite the deployment anyway. Proceed at your own risk.
  • used --force flag and did the deployment using feature branch and it worked as expected.
  • Only recreate of kubectl-apply module related resource were required.
  • Scenario of workload_manager_config and persistent_volume module removal and recreate worked as expected.

How did we test this out ?

  • Use the same setup as in issue description. Remove the workload_manager_config module and recreate, redeploy. Verify that the error does not occur and infra is removed as expected
  • Use the same setup as in issue description. Remove the persistent_volume module and recreate, redeploy. Verify that the error does not occur and infra is removed as expected
  • Test for backward compatibility
  • PR cloud build for all GKE related blueprint should have OK status

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@mohitchaurasia91 mohitchaurasia91 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Dec 16, 2024
@mohitchaurasia91 mohitchaurasia91 changed the title Add kubectl provider in root module for blueprint with gke cluster module Add kubectl provider in root module for blueprint with GKE cluster module setup Dec 16, 2024
@mohitchaurasia91 mohitchaurasia91 self-assigned this Dec 16, 2024
@mohitchaurasia91 mohitchaurasia91 marked this pull request as ready for review December 17, 2024 06:13
Copy link
Contributor

@ankitkinra ankitkinra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this out thoroughly, can you please add if you performed the following test:

Example blueprint with PV (storage)

Case 1

  • First deploy of the blueprint on the develop branch (without the changes) with PV
  • Switch branch and do make
  • Second deploy of the blueprint without any changes , which would typically not result in any diff

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2

  • First deploy of the blueprint on the develop branch (without the changes) with PV
  • Switch branch and do make
  • Second deploy of the blueprint with only nodepool scaleup , which would typically not result in recreation of any module.

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2:
Case 1

  • First deploy of the blueprint on the develop branch (without the changes) with PV
  • Switch branch and do make
  • Second deploy of the blueprint without any changes , which would typically not result in any diff
  • Third deploy of the blueprint of the above blueprint after completion of second deploy

Result
Did the third deploy of blueprint cause any updates / loss of data ?

pkg/config/expand.go Outdated Show resolved Hide resolved
@mohitchaurasia91
Copy link
Contributor Author

mohitchaurasia91 commented Dec 18, 2024

Thanks for testing this out thoroughly, can you please add if you performed the following test:

Example blueprint with PV (storage)

Case 1

  • First deploy of the blueprint on the develop branch (without the changes) with PV
  • Switch branch and do make
  • Second deploy of the blueprint without any changes , which would typically not result in any diff

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2

  • First deploy of the blueprint on the develop branch (without the changes) with PV
  • Switch branch and do make
  • Second deploy of the blueprint with only nodepool scaleup , which would typically not result in recreation of any module.

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2: Case 1

  • First deploy of the blueprint on the develop branch (without the changes) with PV
  • Switch branch and do make
  • Second deploy of the blueprint without any changes , which would typically not result in any diff
  • Third deploy of the blueprint of the above blueprint after completion of second deploy

Result Did the third deploy of blueprint cause any updates / loss of data ?

Thanks for adding a detailed description of test case to further consider. Following are the result from the test, used storage-gke.yaml example blueprint.

Case 1: No diff observed (except for local file output of gke-job-template module, where we are using random_id provider), k8s object like (PV and PVC) and data (filestore and gcs) persisted across the migration, no data loss was observed.

Summary of proposed changes: Plan: 6 to add, 0 to change, 2 to destroy.
Running terraform apply on deployment group storage-gke-03/primary
module.ephemeral-storage-job.random_id.resource_name_suffix: Destroying... [id=pgE]
module.shared-fs-job.random_id.resource_name_suffix: Destroying... [id=Zd4]
module.shared-fs-job.random_id.resource_name_suffix: Destruction complete after 0s
module.ephemeral-storage-job.random_id.resource_name_suffix: Destruction complete after 0s
module.ephemeral-storage-job.random_id.resource_name_suffix: Creating...
module.shared-fs-job.random_id.resource_name_suffix: Creating...
module.ephemeral-storage-job.random_id.resource_name_suffix: Creation complete after 0s [id=VEM]
module.shared-fs-job.random_id.resource_name_suffix: Creation complete after 1s [id=kFA]
module.data-bucket-pv.local_file.debug_file: Creating...
module.data-bucket-pv.local_file.debug_file: Creation complete after 0s [id=d69386e3400928e6f3956408ce8f9a4b69360112]
module.shared-filestore-pv.local_file.debug_file: Creating...
module.shared-filestore-pv.local_file.debug_file: Creation complete after 0s [id=4f05710324270aa5b85fc45a937d278b1552086d]
module.shared-fs-job.local_file.job_template: Creating...
module.shared-fs-job.local_file.job_template: Creation complete after 0s [id=242d0b096cea5345338b0602f1d562148ada3df5]
module.ephemeral-storage-job.local_file.job_template: Creating...
module.ephemeral-storage-job.local_file.job_template: Creation complete after 0s [id=76871c6ea3e38c3d7a5e9b21970665f0d4f23511]

Case 2: Same behavior observed as defined for Case 1

Case 3: Same behavior observed as defined for Case 1

pkg/config/expand.go Show resolved Hide resolved
pkg/config/expand.go Show resolved Hide resolved
pkg/config/expand.go Show resolved Hide resolved
pkg/config/expand.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@sharabiani sharabiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
if @tpdownes @ankitkinra agree with the limit of only one gke-cluster per deployment group: #3406 (comment)

@mohitchaurasia91 mohitchaurasia91 merged commit 81e63e6 into GoogleCloudPlatform:develop Dec 27, 2024
18 of 63 checks passed
@mohitchaurasia91 mohitchaurasia91 deleted the fix-gke-module-provider branch December 27, 2024 07:56
@mohitchaurasia91 mohitchaurasia91 restored the fix-gke-module-provider branch December 30, 2024 08:08
This was referenced Jan 15, 2025
ankitkinra added a commit to ankitkinra/hpc-toolkit that referenced this pull request Jan 21, 2025
…ia91/fix-gke-module-provider"

This reverts commit 81e63e6, reversing
changes made to e17bb15.
@ankitkinra ankitkinra mentioned this pull request Jan 21, 2025
mohitchaurasia91 added a commit to mohitchaurasia91/cluster-toolkit that referenced this pull request Jan 28, 2025
…cker default terraform version 1.5.2 -> 1.5.7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants