Add kubectl provider in root module for blueprint with GKE cluster module setup #3406

mohitchaurasia91 · 2024-12-16T04:57:00Z

Why we need this change ?

When managing Kubernetes manifests with multiple blueprints using the kubectl-apply module, completely removing a kubectl-apply block from a blueprint results in a "Error: Provider configuration not present" error.

  - id: workload_manager_install
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      kueue:
        install: true
      jobset:
        install: true

  - id: workload_manager_config
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      apply_manifests:
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/config-map.yaml.tftpl
        template_vars: {name: "a3plus-benchmark-resources-configmap", num_nodes: "1"}
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/kueue-credentials.yaml.tftpl
        template_vars: {num_chips: "8"}

Similar issue is observed for PersistentVolume and PersistentVolumeClaims, when provisioning these resources using gke-persistent-volume module, which internally uses kubectl-apply module for k8s manifest deployment.

  - id: data-bucket-pv
    source: modules/file-system/gke-persistent-volume
    use: [gke_cluster, data-bucket]
    settings: {capacity_gb: 5000}
    
 - id: shared-filestore-pv
    source: modules/file-system/gke-persistent-volume
    use: [gke_cluster, filestore]

Removing workload_manager_config or persistent_volume and recreating and redeploying gives us the error:

Error: Provider configuration not present

To work with
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan) its original provider configuration at
module.workload_manager_config.provider["registry.terraform.io/gavinbunney/kubectl"]
is required, but it has been removed. This occurs when a provider
configuration is removed while objects created by that provider still exist
in the state. Re-add the provider configuration to destroy
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan), after which you can remove the provider configuration again.

RCA

The issue stems from how the provider is defined. Unlike other Terraform providers which are typically defined at the root level, the kubectl-apply provider is defined within the modules/management/kubectl-apply/providers.tf file inside the module itself.

This becomes problematic when deleting a blueprint configuration that uses this module. While the module's folder remains, Terraform effectively loses track of it and its associated provider. Consequently, Terraform can't determine the correct provider (kubectl_apply_manifests) to use when attempting to delete the resources associated with the removed module.

This behavior is due to Terraform's handling of child modules and their providers. You can find a more detailed explanation in this Stack Overflow answer: https://stackoverflow.com/a/58403262

Solution

Moving kubectl provider to the root module (under TerraformProviders block) of the blueprint group, if the group contains dependent module of gke-cluster or pre-existing-gke-cluster module.

Backward Compatibility Test

Tested this out for example blueprint (storage-gke.yaml, gke-a3-highgpu.yaml), following are the steps.

on develop branch, perform gcluster deploy for example blueprint.
switched over to feature branch and performed make to update gcluster binary.
performed gcluster deploy, got a warning of past deployment was performed on diff gcluster version, suggesting to use --force based on user discretion.

Creating deployment folder "gke-a3-highgpu" ...
Error: ghpc_version has changed from "45068640-dirty" to "90aa74bb-dirty", using different versions of GHPC to update a live deployment is not officially supported
Hint: Use `--force` to overwrite the deployment anyway. Proceed at your own risk.

used --force flag and did the deployment using feature branch and it worked as expected.
Only recreate of kubectl-apply module related resource were required.
Scenario of workload_manager_config and persistent_volume module removal and recreate worked as expected.

How did we test this out ?

Use the same setup as in issue description. Remove the workload_manager_config module and recreate, redeploy. Verify that the error does not occur and infra is removed as expected
Use the same setup as in issue description. Remove the persistent_volume module and recreate, redeploy. Verify that the error does not occur and infra is removed as expected
Test for backward compatibility
PR cloud build for all GKE related blueprint should have OK status

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

… module

ankitkinra

Thanks for testing this out thoroughly, can you please add if you performed the following test:

Example blueprint with PV (storage)

Case 1

First deploy of the blueprint on the develop branch (without the changes) with PV
Switch branch and do make
Second deploy of the blueprint without any changes , which would typically not result in any diff

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2

First deploy of the blueprint on the develop branch (without the changes) with PV
Switch branch and do make
Second deploy of the blueprint with only nodepool scaleup , which would typically not result in recreation of any module.

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2:
Case 1

First deploy of the blueprint on the develop branch (without the changes) with PV
Switch branch and do make
Second deploy of the blueprint without any changes , which would typically not result in any diff
Third deploy of the blueprint of the above blueprint after completion of second deploy

Result
Did the third deploy of blueprint cause any updates / loss of data ?

pkg/config/expand.go

mohitchaurasia91 · 2024-12-18T09:24:13Z

Thanks for testing this out thoroughly, can you please add if you performed the following test:

Example blueprint with PV (storage)

Case 1

First deploy of the blueprint on the develop branch (without the changes) with PV

Switch branch and do make

Second deploy of the blueprint without any changes , which would typically not result in any diff

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2

First deploy of the blueprint on the develop branch (without the changes) with PV

Switch branch and do make

Second deploy of the blueprint with only nodepool scaleup , which would typically not result in recreation of any module.

Result: Did it cause recreation of kubectl created modules ? Specifically if blueprint had storage, did it cause data loss ?

Case 2: Case 1

First deploy of the blueprint on the develop branch (without the changes) with PV

Switch branch and do make

Second deploy of the blueprint without any changes , which would typically not result in any diff

Third deploy of the blueprint of the above blueprint after completion of second deploy

Result Did the third deploy of blueprint cause any updates / loss of data ?

Thanks for adding a detailed description of test case to further consider. Following are the result from the test, used storage-gke.yaml example blueprint.

Case 1: No diff observed (except for local file output of gke-job-template module, where we are using random_id provider), k8s object like (PV and PVC) and data (filestore and gcs) persisted across the migration, no data loss was observed.

Summary of proposed changes: Plan: 6 to add, 0 to change, 2 to destroy.
Running terraform apply on deployment group storage-gke-03/primary
module.ephemeral-storage-job.random_id.resource_name_suffix: Destroying... [id=pgE]
module.shared-fs-job.random_id.resource_name_suffix: Destroying... [id=Zd4]
module.shared-fs-job.random_id.resource_name_suffix: Destruction complete after 0s
module.ephemeral-storage-job.random_id.resource_name_suffix: Destruction complete after 0s
module.ephemeral-storage-job.random_id.resource_name_suffix: Creating...
module.shared-fs-job.random_id.resource_name_suffix: Creating...
module.ephemeral-storage-job.random_id.resource_name_suffix: Creation complete after 0s [id=VEM]
module.shared-fs-job.random_id.resource_name_suffix: Creation complete after 1s [id=kFA]
module.data-bucket-pv.local_file.debug_file: Creating...
module.data-bucket-pv.local_file.debug_file: Creation complete after 0s [id=d69386e3400928e6f3956408ce8f9a4b69360112]
module.shared-filestore-pv.local_file.debug_file: Creating...
module.shared-filestore-pv.local_file.debug_file: Creation complete after 0s [id=4f05710324270aa5b85fc45a937d278b1552086d]
module.shared-fs-job.local_file.job_template: Creating...
module.shared-fs-job.local_file.job_template: Creation complete after 0s [id=242d0b096cea5345338b0602f1d562148ada3df5]
module.ephemeral-storage-job.local_file.job_template: Creating...
module.ephemeral-storage-job.local_file.job_template: Creation complete after 0s [id=76871c6ea3e38c3d7a5e9b21970665f0d4f23511]

Case 2: Same behavior observed as defined for Case 1

Case 3: Same behavior observed as defined for Case 1

pkg/config/expand.go

modules/compute/gke-node-pool/variables.tf

Develop

sharabiani

LGTM
if @tpdownes @ankitkinra agree with the limit of only one gke-cluster per deployment group: #3406 (comment)

Develop

…ia91/fix-gke-module-provider" This reverts commit 81e63e6, reversing changes made to e17bb15.

…cker default terraform version 1.5.2 -> 1.5.7

mohitchaurasia91 added 7 commits December 13, 2024 08:43

Initial changes for fixing gke module provider

1b1615e

Further refined changes for gke root module

ef89293

Append kubectl provider in root module for blueprint with gke cluster…

03413f8

… module

Append kubectl provider in root module for blueprint with gke cluster…

da63370

… module

Further refined changes

0e1d96e

Added changes related to kubectl-apply module

d072505

Added changes related to kubectl-apply module

f45e151

mohitchaurasia91 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Dec 16, 2024

Added unit test cases

c59164f

mohitchaurasia91 changed the title ~~Add kubectl provider in root module for blueprint with gke cluster module~~ Add kubectl provider in root module for blueprint with GKE cluster module setup Dec 16, 2024

Updated unit test case

93fa0f4

mohitchaurasia91 self-assigned this Dec 16, 2024

Updated unit test cases

90aa74b

mohitchaurasia91 requested review from ankitkinra, tpdownes and sharabiani December 17, 2024 06:13

mohitchaurasia91 marked this pull request as ready for review December 17, 2024 06:13

ankitkinra reviewed Dec 17, 2024

View reviewed changes

pkg/config/expand.go Outdated Show resolved Hide resolved

mohitchaurasia91 requested a review from ankitkinra December 18, 2024 09:25

sharabiani reviewed Dec 18, 2024

View reviewed changes

pkg/config/expand.go Show resolved Hide resolved

pkg/config/expand.go Show resolved Hide resolved

pkg/config/expand.go Show resolved Hide resolved

mohitchaurasia91 requested a review from sharabiani December 18, 2024 15:31

sharabiani reviewed Dec 18, 2024

View reviewed changes

pkg/config/expand.go Outdated Show resolved Hide resolved

sharabiani reviewed Dec 18, 2024

View reviewed changes

modules/compute/gke-node-pool/variables.tf Show resolved Hide resolved

Updated gke cluster module source path check

6534ee3

mohitchaurasia91 requested a review from sharabiani December 18, 2024 17:08

Merge pull request #3 from mohitchaurasia91/develop

4d2d684

Develop

sharabiani reviewed Dec 18, 2024

View reviewed changes

mohitchaurasia91 requested a review from sharabiani December 18, 2024 19:10

mohitchaurasia91 added 2 commits December 19, 2024 12:26

Merge pull request #4 from mohitchaurasia91/develop

421d144

Develop

Merge branch 'develop' into fix-gke-module-provider

90a0205

ighosh98 mentioned this pull request Dec 26, 2024

A3 Ultra GKE deployment fails due to kubectl misconfig when regional/zonal cluster settings are mixed up #3427

Open

ankitkinra approved these changes Dec 26, 2024

View reviewed changes

mohitchaurasia91 merged commit 81e63e6 into GoogleCloudPlatform:develop Dec 27, 2024
18 of 63 checks passed

mohitchaurasia91 deleted the fix-gke-module-provider branch December 27, 2024 07:56

mohitchaurasia91 restored the fix-gke-module-provider branch December 30, 2024 08:08

This was referenced Jan 15, 2025

Release v1.45.0 #3533

Merged

Revert PR 3406 #3572

Closed

ankitkinra added a commit to ankitkinra/hpc-toolkit that referenced this pull request Jan 21, 2025

Revert "Merge pull request GoogleCloudPlatform#3406 from mohitchauras…

fcc2d8d

…ia91/fix-gke-module-provider" This reverts commit 81e63e6, reversing changes made to e17bb15.

ankitkinra mentioned this pull request Jan 21, 2025

Revert PR 3406 #3573

Merged

mohitchaurasia91 mentioned this pull request Jan 28, 2025

Bump toolkit docker default terraform version from 1.5.2 -> 1.5.7 #3596

Closed

mohitchaurasia91 added a commit to mohitchaurasia91/cluster-toolkit that referenced this pull request Jan 28, 2025

Add back changes from PR GoogleCloudPlatform#3406 and bump toolkit do…

cb1412e

…cker default terraform version 1.5.2 -> 1.5.7

mohitchaurasia91 mentioned this pull request Jan 28, 2025

Add back diff from PR #3406 and bump toolkit docker default terraform version 1.5.2 -> 1.5.7 #3597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kubectl provider in root module for blueprint with GKE cluster module setup #3406

Add kubectl provider in root module for blueprint with GKE cluster module setup #3406

mohitchaurasia91 commented Dec 16, 2024 •

edited

Loading

ankitkinra left a comment

mohitchaurasia91 commented Dec 18, 2024 •

edited

Loading

sharabiani left a comment

Add kubectl provider in root module for blueprint with GKE cluster module setup #3406

Add kubectl provider in root module for blueprint with GKE cluster module setup #3406

Conversation

mohitchaurasia91 commented Dec 16, 2024 • edited Loading

Why we need this change ?

RCA

Solution

Backward Compatibility Test

How did we test this out ?

Submission Checklist

ankitkinra left a comment

Choose a reason for hiding this comment

mohitchaurasia91 commented Dec 18, 2024 • edited Loading

sharabiani left a comment

Choose a reason for hiding this comment

mohitchaurasia91 commented Dec 16, 2024 •

edited

Loading

mohitchaurasia91 commented Dec 18, 2024 •

edited

Loading