[SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS #469

mjura · 2024-04-16T10:43:11Z

Issue description:

If you have an EKS cluster created in AWS and then imported into Rancher, once you modify the Launch Template version for the Node Group in Rancher, any change you make to the version on the AWS side is rolled back by Rancher.

The customer sees this behavior in Rancher 2.8.2. I was able to reproduce this in 2.7.12. I thought perhaps the behavior is expected per the documentation: https://ranchermanager.docs.rancher.com/reference-guides/cluster-configuration/rancher-server-configuration/sync-clusters

The AKSConfig, EKSConfig or GKEConfig represents the desired state. Nil values are ignored. Fields that are non-nil in the config object can be thought of as managed. When a cluster is created in Rancher, all fields are non-nil and therefore managed. When a pre-existing cluster is registered in Rancher all nillable fields are set to nil and aren’t managed. Those fields become managed once their value has been changed by Rancher.

cpinjani · 2024-04-18T09:02:34Z

Able to reproduce with steps:

Provision EKS cluster from Rancher with nodegroup having custom AMI user-managed launch template
Once cluster is Active, edit it from Rancher and upgrade control plane k8s to a higher version, wait for it to complete
From AWS console in order to upgrade nodegroup k8s version, change Launch template version to newer one
VersionUpdate completes, but is rolled-back to previous Launch template version

After step 4 if VersionUpdate (or step 3) is performed from Rancher, it succeeds

LefterisBanos · 2024-04-19T09:00:02Z

Hi,

This issue started only happening after upgrading to version > = 2.8.0.
This issue is not properly described. This is not happening only for the Launch Template version, this is happening to every field under EKSConfig at the clusters.management.cattle.io, which Rancher consider as managed fields.

eksConfig:
    amazonCredentialSecret: cattle-global-data:cc-k2qss
    displayName: dev01-dev
    ebsCSIDriver: null
    imported: true
    kmsKey: null
    kubernetesVersion: null
    loggingTypes: null
    nodeGroups: null
    privateAccess: null
    publicAccess: null
    publicAccessSources: null
    region: eu-west-1
    secretsEncryption: null
    securityGroups: null
    serviceRole: ""
    subnets: null
    tags: null

We are creating EKS clusters with Terraform and using Rancher2 Terraform provider to import the EKS clusters to Rancher.
So anytime we update from terraform any field of the above eksConfig list, Rancher will revert if after set as managed, are not configured from the Rancher UI.

SO we have been facing, tags to get reseted, loggingTypes to get reseted, as well as Launch templates to get rolled back, because changes have been applied with Terraform and Rancher is not fetching those changes, but the opposite forces what is actually currently configured from the UI.

According to Rancher docs:

UpstreamSpec represents the cluster as it is in the hosted Kubernetes provider. It's refreshed every 5 minutes. After the UpstreamSpec is refreshed, Rancher checks if the cluster has an update in progress. If it's currently updating, nothing further is done. If it is not currently updating, any managed fields on AKSConfig, EKSConfig or GKEConfig are overwritten with their corresponding value from the recently updated UpstreamSpec.

From the Docs, we understand that Rancher should sync with AWS provider, and not the other way around. And this is the proper approach, Rancher should not force the configuration set on UI, but should always sync first with the provider state.

Issue: rancher/eks-operator#469

salasberryfin · 2024-04-30T09:09:00Z

Hello @LefterisBanos, thanks for your detailed message.

We have been investigating this issue and testing out different scenarios and we've noticed that the paragraph from the Rancher Docs that you're referring to does not align with the actual behavior of the controller in specific situations. The original description of the issue references Launch Templates because that's what we initiated the investigation with, but it can be extensible to other parameters, as you mentioned.

When a cluster is created from AWS and later imported into Rancher, the EKSConfig specification will contain a series of null elements. These null values are then taken from UpstreamSpec, which represents the actual state of the node group. If a change is applied via Rancher UI, this will be reflected in EKSConfig and any further change in UpstreamSpec to this parameter will be rolled back as the controller is not able to identify which of the two is the source of truth, and will default to use EKSConfig as desired state.

One of the scenarios we've tested involves creating a cluster via AWS, importing into Rancher and then applying changes via AWS Console. In this case, any modifications to UpstreamSpec are persisted and we haven't experienced any rollbacks (including changes to launch templates and node group tags). This leads us to believe that the error appears only when the workflow is the following:

Create cluster via AWS.
Import into Rancher.
Apply changes via Rancher.
Apply changes via AWS.

We're updating the docs to reflect this behavior and prevent the current misunderstanding caused by the paragraph you quoted.

Please, let us know if you have any other concerns related to this.

LefterisBanos · 2024-04-30T09:41:49Z

hi @salasberryfin,

Thank you for your comment.
I am not sure I understand this sentence of yours?

This leads us to believe that the error appears only when the workflow is the following:

1 Create cluster via AWS.
2 Import into Rancher.
3 Apply changes via Rancher.
4 Apply changes via AWS.`

So you mean that on step 4 of your tests after applying changes via Rancher, and then applying changes via AWS console the changes at step 4 were rolled back?

If that is the case I am not sure I understand how this can be considered as expected behaviour? This is going against IaC practises.

I mean that it is clear that you can modify a cluster from Rancher UI, but you can not force Rancher to be the source of truth or to have higher priority over AWS console. This way once you do any minor change from the Rancher UI, your IaC can not be used anymore.

Regarding IaC (terraform), we consider that always AWS console should be the actual source of truth. Our use case is the following:

Create a cluster with IaC
Import the cluster with Rancher IaC provider
Any change from Rancher UI gives cluster management to Rancher.
Any change applied from IaC get reverted. (IaC is not our source of truth anymore)

One more thing is that, it seems that EksConfig is not properly updated from the UpstreamSpec. Please test to change something from the Rancher UI, increase the number of nodes on a node group, the nodegroup section on EksConfig may get properly updated from the UpstreamSpec, but other fields will not, and will get set to like empty list (in the case of the Tags for example.) Eventually Rancher will remove any Tags set on the cluster because for some reason the tags field of the EksConfig was not properly updated from the UpstreamSpec.

Thank you.

mjura · 2024-05-20T09:51:57Z

Let's continue with regular communication process through official support request.

kkaempf · 2024-07-09T15:22:32Z

Closing due to no response.

LefterisBanos · 2024-07-10T09:00:58Z

This issue has a priority not sure why it is closed.

kkaempf · 2024-07-12T07:07:36Z

@LefterisBanos the "community issue" on Github is closed, the (internal) support request is still open.

mjura assigned salasberryfin Apr 16, 2024

kkaempf added kind/bug Something isn't working JIRA Must shout labels Apr 17, 2024

salasberryfin changed the title ~~[SURE-8192] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS~~ [SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS Apr 17, 2024

mjura added a commit to mjura/support-tools that referenced this issue Apr 19, 2024

Add new option to unset node groups as managed fields for EKS Cluster

c859bc8

Issue: rancher/eks-operator#469

mjura mentioned this issue Apr 19, 2024

Add new option to unset node groups as managed fields for EKS Cluster rancherlabs/support-tools#264

Merged

mjura added a commit to mjura/support-tools that referenced this issue Apr 19, 2024

Add new option to unset node groups as managed fields for EKS Cluster

7569c87

Issue: rancher/eks-operator#469

salasberryfin mentioned this issue Apr 30, 2024

docs: edit hosted providers upstream/config sync rancher/rancher-docs#1258

Merged

mjura self-assigned this May 8, 2024

kkaempf closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS #469

[SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS #469

mjura commented Apr 16, 2024

cpinjani commented Apr 18, 2024

LefterisBanos commented Apr 19, 2024 •

edited

Loading

salasberryfin commented Apr 30, 2024

LefterisBanos commented Apr 30, 2024 •

edited

Loading

mjura commented May 20, 2024

kkaempf commented Jul 9, 2024

LefterisBanos commented Jul 10, 2024

kkaempf commented Jul 12, 2024

[SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS #469

[SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS #469

Comments

mjura commented Apr 16, 2024

Issue description:

cpinjani commented Apr 18, 2024

LefterisBanos commented Apr 19, 2024 • edited Loading

salasberryfin commented Apr 30, 2024

LefterisBanos commented Apr 30, 2024 • edited Loading

mjura commented May 20, 2024

kkaempf commented Jul 9, 2024

LefterisBanos commented Jul 10, 2024

kkaempf commented Jul 12, 2024

LefterisBanos commented Apr 19, 2024 •

edited

Loading

LefterisBanos commented Apr 30, 2024 •

edited

Loading