Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS #469

Closed
mjura opened this issue Apr 16, 2024 · 8 comments
Assignees
Labels
JIRA Must shout kind/bug Something isn't working

Comments

@mjura
Copy link
Contributor

mjura commented Apr 16, 2024

Issue description:

If you have an EKS cluster created in AWS and then imported into Rancher, once you modify the Launch Template version for the Node Group in Rancher, any change you make to the version on the AWS side is rolled back by Rancher.

The customer sees this behavior in Rancher 2.8.2. I was able to reproduce this in 2.7.12. I thought perhaps the behavior is expected per the documentation: https://ranchermanager.docs.rancher.com/reference-guides/cluster-configuration/rancher-server-configuration/sync-clusters

The AKSConfig, EKSConfig or GKEConfig represents the desired state. Nil values are ignored. Fields that are non-nil in the config object can be thought of as managed. When a cluster is created in Rancher, all fields are non-nil and therefore managed. When a pre-existing cluster is registered in Rancher all nillable fields are set to nil and aren’t managed. Those fields become managed once their value has been changed by Rancher.

@kkaempf kkaempf added kind/bug Something isn't working JIRA Must shout labels Apr 17, 2024
@salasberryfin salasberryfin changed the title [SURE-8192] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS [SURE-8182] Rancher rolls back to the previous Launch Template version of an EKS cluster after updating it in AWS Apr 17, 2024
@cpinjani
Copy link

Able to reproduce with steps:

  1. Provision EKS cluster from Rancher with nodegroup having custom AMI user-managed launch template
  2. Once cluster is Active, edit it from Rancher and upgrade control plane k8s to a higher version, wait for it to complete
  3. From AWS console in order to upgrade nodegroup k8s version, change Launch template version to newer one
  4. VersionUpdate completes, but is rolled-back to previous Launch template version

  • After step 4 if VersionUpdate (or step 3) is performed from Rancher, it succeeds

@LefterisBanos
Copy link

LefterisBanos commented Apr 19, 2024

Hi,

This issue started only happening after upgrading to version > = 2.8.0.
This issue is not properly described. This is not happening only for the Launch Template version, this is happening to every field under EKSConfig at the clusters.management.cattle.io, which Rancher consider as managed fields.

eksConfig:
    amazonCredentialSecret: cattle-global-data:cc-k2qss
    displayName: dev01-dev
    ebsCSIDriver: null
    imported: true
    kmsKey: null
    kubernetesVersion: null
    loggingTypes: null
    nodeGroups: null
    privateAccess: null
    publicAccess: null
    publicAccessSources: null
    region: eu-west-1
    secretsEncryption: null
    securityGroups: null
    serviceRole: ""
    subnets: null
    tags: null

We are creating EKS clusters with Terraform and using Rancher2 Terraform provider to import the EKS clusters to Rancher.
So anytime we update from terraform any field of the above eksConfig list, Rancher will revert if after set as managed, are not configured from the Rancher UI.

SO we have been facing, tags to get reseted, loggingTypes to get reseted, as well as Launch templates to get rolled back, because changes have been applied with Terraform and Rancher is not fetching those changes, but the opposite forces what is actually currently configured from the UI.

According to Rancher docs:

UpstreamSpec represents the cluster as it is in the hosted Kubernetes provider. It's refreshed every 5 minutes. After the UpstreamSpec is refreshed, Rancher checks if the cluster has an update in progress. If it's currently updating, nothing further is done. If it is not currently updating, any managed fields on AKSConfig, EKSConfig or GKEConfig are overwritten with their corresponding value from the recently updated UpstreamSpec.

From the Docs, we understand that Rancher should sync with AWS provider, and not the other way around. And this is the proper approach, Rancher should not force the configuration set on UI, but should always sync first with the provider state.

@salasberryfin
Copy link
Contributor

Hello @LefterisBanos, thanks for your detailed message.

We have been investigating this issue and testing out different scenarios and we've noticed that the paragraph from the Rancher Docs that you're referring to does not align with the actual behavior of the controller in specific situations. The original description of the issue references Launch Templates because that's what we initiated the investigation with, but it can be extensible to other parameters, as you mentioned.

When a cluster is created from AWS and later imported into Rancher, the EKSConfig specification will contain a series of null elements. These null values are then taken from UpstreamSpec, which represents the actual state of the node group. If a change is applied via Rancher UI, this will be reflected in EKSConfig and any further change in UpstreamSpec to this parameter will be rolled back as the controller is not able to identify which of the two is the source of truth, and will default to use EKSConfig as desired state.

One of the scenarios we've tested involves creating a cluster via AWS, importing into Rancher and then applying changes via AWS Console. In this case, any modifications to UpstreamSpec are persisted and we haven't experienced any rollbacks (including changes to launch templates and node group tags). This leads us to believe that the error appears only when the workflow is the following:

  1. Create cluster via AWS.
  2. Import into Rancher.
  3. Apply changes via Rancher.
  4. Apply changes via AWS.

We're updating the docs to reflect this behavior and prevent the current misunderstanding caused by the paragraph you quoted.

Please, let us know if you have any other concerns related to this.

@LefterisBanos
Copy link

LefterisBanos commented Apr 30, 2024

hi @salasberryfin,

Thank you for your comment.
I am not sure I understand this sentence of yours?

This leads us to believe that the error appears only when the workflow is the following:

1 Create cluster via AWS.
2 Import into Rancher.
3 Apply changes via Rancher.
4 Apply changes via AWS.`

So you mean that on step 4 of your tests after applying changes via Rancher, and then applying changes via AWS console the changes at step 4 were rolled back?

If that is the case I am not sure I understand how this can be considered as expected behaviour? This is going against IaC practises.

I mean that it is clear that you can modify a cluster from Rancher UI, but you can not force Rancher to be the source of truth or to have higher priority over AWS console. This way once you do any minor change from the Rancher UI, your IaC can not be used anymore.

Regarding IaC (terraform), we consider that always AWS console should be the actual source of truth. Our use case is the following:

  • Create a cluster with IaC
  • Import the cluster with Rancher IaC provider
  • Any change from Rancher UI gives cluster management to Rancher.
  • Any change applied from IaC get reverted. (IaC is not our source of truth anymore)

One more thing is that, it seems that EksConfig is not properly updated from the UpstreamSpec. Please test to change something from the Rancher UI, increase the number of nodes on a node group, the nodegroup section on EksConfig may get properly updated from the UpstreamSpec, but other fields will not, and will get set to like empty list (in the case of the Tags for example.) Eventually Rancher will remove any Tags set on the cluster because for some reason the tags field of the EksConfig was not properly updated from the UpstreamSpec.

Thank you.

@mjura mjura self-assigned this May 8, 2024
@mjura
Copy link
Contributor Author

mjura commented May 20, 2024

Let's continue with regular communication process through official support request.

@kkaempf
Copy link

kkaempf commented Jul 9, 2024

Closing due to no response.

@kkaempf kkaempf closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2024
@LefterisBanos
Copy link

This issue has a priority not sure why it is closed.

@kkaempf
Copy link

kkaempf commented Jul 12, 2024

@LefterisBanos the "community issue" on Github is closed, the (internal) support request is still open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
JIRA Must shout kind/bug Something isn't working
Development

No branches or pull requests

5 participants