Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] Cloudformation support for cluster upgrades #115

Closed
dcherman opened this issue Jan 16, 2019 · 29 comments
Closed

[EKS] Cloudformation support for cluster upgrades #115

dcherman opened this issue Jan 16, 2019 · 29 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@dcherman
Copy link

Tell us about your request
What do you want us to build?

Support for upgrading an existing EKS instance provisioned by Cloudformation rather than requiring replacement

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

I'm trying to upgrade an EKS cluster between versions without replacing the cluster which introduces risk since the behavior of replacement is not well defined (i.e., is the etcd state migrated? Backups? What about requests that might be in-flight when the changeover happens?). The existing behavior would also likely require rolling the worker nodes since the cluster API would change, unless you put it behind a CNAME or something.

Instead, CloudFormation should simply upgrade the cluster via the API that is already available for doing so and which both the AWS CLI and Terraform support.

Are you currently working around this issue?

Yes

How are you currently solving this problem?

Managing the EKS cluster with Terraform

Additional context
Anything else we should know?

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

@dcherman dcherman added the Proposed Community submitted issue label Jan 16, 2019
@tabern tabern added the EKS Amazon Elastic Kubernetes Service label Jan 16, 2019
@tabern
Copy link
Contributor

tabern commented Jan 16, 2019

@dcherman EKS supports in-place cluster upgrades via the EKS API (https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html) and worker node updates via Cloudformation (https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html) - shipped as part of #21.

Does this resolve your issue or are you thinking of additional/different functionality for cluster upgrades?

@dcherman
Copy link
Author

@tabern So part of what you can do with Cloudformation is specify the Kubernetes version that you want. If you change that in your template and re-apply the stack, it required replacement of the resource.

What I'm proposing is that Cloudformation should use the EKS API internally to perform these upgrades rather than replacing the resource.

@tabern
Copy link
Contributor

tabern commented Jan 16, 2019

Got it - so the idea is you can do an entire cluster incl. nodes with a single CF stack update?

@dcherman
Copy link
Author

Exactly; I want to avoid creating and updating clusters using different methods since the Cloudformation template is no longer the source of truth if you're updating the cluster outside of it.

@janerikcarlsen
Copy link

+1 for this

@tabern
Copy link
Contributor

tabern commented Jan 16, 2019

Ok - good info. Thanks!

@christopherhein
Copy link

christopherhein commented Jan 22, 2019

@dcherman if you are have used eksctl to provision your clusters we're actively working on the list of features necessary to make this a little easier, there might be some overlaps where you could use the APIs being built/already built to help reduce some of this time.

Check out - eksctl-io/eksctl#348 for more details.

@cdenneen
Copy link

@christopherhein is the goal, recommendation by Amazon, for people to use eksctl to create and manage EKS clusters instead of CloudFormation?

@dcherman
Copy link
Author

@christopherhein I'm actively monitoring eksctl and am actually looking for ways to contribute there :)

That said, eksctl builds the clusters by using cloudformation internally, and this issue was actually filed as a result of a discussion that we were having in the #eks slack channel about the usage of eksctl w/ GitOps, so this is actually a pre-requisite to implement cluster upgrades (correctly) in eksctl without having to have them draw outside the lines of cloudformation either and hit the upgrade API directly.

@christopherhein
Copy link

@christopherhein is the goal, recommendation by Amazon, for people to use eksctl to create and manage EKS clusters instead of CloudFormation?

It's an option, we have contributed a handful of things to eksctl and have been working closely with Weaveworks and @errordeveloper on the project. There are other ways too for example if your organization uses Terraform there are Hashicorp supported ways of deploying.

@dcherman check out eksctl-io/eksctl#19 if you haven't seen it, at one point we were discussing using the ClusterAPI functions to support this style of deployment, still a lot to do if you want to help. :)

@cdenneen
Copy link

@christopherhein understood the option, I just wanted to confirm that wasn't a replacement. The OP was for better way to improve cluster updates in CloudFormation, just didn't want it to get lost. I'm very intrigued on how much AWS has used eksctl that it might be more of a preferred way than even using CF directly anymore... something to ponder over 🤔 thanks as always!

@tabern tabern added this to We're Working On It in containers-roadmap Jan 30, 2019
@tabern tabern changed the title [EKS] [Cloudformation]: Improve cluster upgrades [EKS] Cloudformation support for cluster upgrades Jan 30, 2019
@tabern tabern moved this from We're Working On It to Researching in containers-roadmap Jan 30, 2019
@tabern tabern moved this from Researching to We're Working On It in containers-roadmap Jan 30, 2019
@tabern tabern moved this from We're Working On It to Coming Soon in containers-roadmap Feb 15, 2019
@tabern tabern removed the Proposed Community submitted issue label Feb 19, 2019
@tabern
Copy link
Contributor

tabern commented Mar 29, 2019

@tabern tabern closed this as completed Mar 29, 2019
@tabern tabern moved this from Coming Soon to Just Shipped in containers-roadmap Mar 29, 2019
@geerlingguy
Copy link

@tabern - To clarify, do you mean that if I am managing my EKS cluster in CloudFormation, and it is 1.11 right now... then I update the template to 1.12, and also update my EKS ASG nodegroup AMI to the latest version, it will be Kubernetes-aware, and will upgrade the nodes and add taints like NoExecute automatically, so Pods drop off the terminating instances automatically?

Or will it just be the equivalent of pressing the 'upgrade cluster' button in the console, where it just controls the upgrade/rollout on the masters?

Because this ticket was specifically intended (at least in my reading?) to be the former. The latter is great and all... but there's still a huge pain point in rolling out the AMI update, because we basically have to build our own automation to cleanly and safely upgrade the EKS NodeGroup.

@whereisaaron
Copy link

@geerlingguy pretty sure the CF upgrade only applies to the control plane. I notice when AWS says 'cluster' they are often only thinking of the bit they manage! 😄 And I think that was what this ticket was about, because before with CF, if you changed the EKS version, CF would delete and recreate the control plane. Not what you want! 😢

For the worker nodes, users can do anything the want, including any custom AMI's, so it wouldn't easily be possible for CF to identify the AMI to use to upgrade nodes in the general case. CF/EKS doesn't actually know what ASGs are relevant to the cluster, just which instances have registered, further complicating any possible upgrade.

One option, not just for EKS, is to bring up a new, upgraded node group ASG. Then once it is stable, drain the old node group nodes, and then delete that node group ASG. If you are using eksctl you can do this with roughly:

eksctl create nodegroup --cluster foo --name new
eksctl drain nodegroup --cluster foo --name old
eksctl delete nodegroup --cluster foo --name old

There is also discussion of adding a --replaces option to create nodegroup so that can all be a one-step process. eksctl-io/eksctl#443

If you just want to update the AMI in your ASG and let it roll and update, then you can run an auto-drain Daemonset like kube-aws uses. It watches for ASG and Spot Fleet terminations and auto-drains the node before it is actually gets terminated. With that in place you can do a a regular ASG rolling update of the AMI.

@tabern tabern added the Proposed Community submitted issue label Mar 29, 2019
@tabern
Copy link
Contributor

tabern commented Mar 29, 2019

@geerlingguy The feature that we shipped today is the later. When you update the version via CloudFormation, it triggers the updateClusterVersion API to begin the cluster update process.

Instead, CloudFormation should simply upgrade the cluster via the API that is already available for doing so and which both the AWS CLI and Terraform support.

What you describe makes a lot of sense:

then I update the template to 1.12, and also update my EKS ASG nodegroup AMI to the latest version, it will be Kubernetes-aware, and will upgrade the nodes and add taints like NoExecute automatically, so Pods drop off the terminating instances automatically

The functionality you (and @whereisaaron) are describing is a bit more complex and is most similar to #139

@vincentheet
Copy link

I'm having an issue with the current implementation of this feature.

Scenario 1
The default behaviour of the CloudFormation EKSCluster resource is to create a cluster with the latest available Kubernetes version if no version property is specified. When you explicitly specify the version in a later version of the CloudFormation template due to a stack update the update fails. This is due to the fact that when explicitly specifying the version it's currently on the stack fails to update with the error "No Updated To be Performed" on the EKSCluster resource. The error itself is correct but it means we are unable to lock down the version in a newer CloudFormation template.
CF Templates: https://gist.github.com/vincentheet/e826e39d0c47cdb79310866cccce2acd

Scenario 2
If you initially create a EKSCluster with the version property on 1.11 and want to update this cluster to 1.12 with a new CloudFormation template. The CloudFormation stack can come in an erroneous / deadlock state if there is another resources in the new CloudFormation template that want's to rollback the whole CF stack. When the EKSCluster resource is successfully upgraded from 1.11 to 1.12 another resource in the same CF stack fails to update then the EKSCluster tries to rollback. The rollback on the EKSCluster fails because of the error "Update failed because of Kubernetes version is required". Since this rollback is not supported by EKS the CF stack comes in an error state. When then trying to rollout a fixed / correct CF template the EKSCluster update fails because it is already updated with the error: "The following resources failed to update: EksCluster"
CF Templates: https://gist.github.com/vincentheet/f4047c3bb1461d9f05430cea1b74d681

Suggested solution
When an EKSCluster resource is being requested to update its version from CloudFormation please verify if the EKSCluster already is on the requested version. So for example, if the EKSCluster already is on 1.12 then ignore the update request and report a successful state to Cloudformation instead of an error. This will result in the fact that other resources in the same CloudFormation stack can be updated.

FYI: I opened a case with support but they mentioned it would be good if I place my issue here as well.

It would be great if this issue can be fixed.

@rtkgjacobs
Copy link

rtkgjacobs commented Apr 9, 2019

@Ivincentheet - 'm having an issue with the current implementation of this feature.

This is also impacting our automated rollout of EKS 1.12 (and controlling our version deployment automation). We are experiencing the same issue as shown above.

+1

@qthuy
Copy link

qthuy commented May 9, 2019

I was able to reproduce @vincentheet issue #2 and you can not update the stack anymore once it in this state.

Here is the error I see.

Update failed because of Unsupported Kubernetes minor version update from 1.12 to 1.12 (Service: AmazonEKS; Status Code: 400; Error Code: InvalidParameterException

@qthuy
Copy link

qthuy commented May 9, 2019

@tabern Do we need to create a new issue since this one is closed for it to be addressed?

@tabern
Copy link
Contributor

tabern commented Jul 31, 2019

@qthuy sorry for the delay on this - we are taking a look at this

@midN
Copy link

midN commented Sep 18, 2019

How is this still a thing, wth? It's a major bug, reported almost a year ago now, still not fixed.

This breaks CF stacks that launch EKS Clusters in a really bad way.
Having this non-fixed for a year is just forcing your customers to look for alternatives to CF such as Terraform.

@jia2
Copy link

jia2 commented Sep 22, 2019

I was also able to reproduce @vincentheet issue #2 and can not update the stack anymore once it in this state.

Update failed because of Unsupported Kubernetes minor version update from 1.13 to 1.13 (Service: AmazonEKS; Status Code: 400; Error Code: InvalidParameterException

This error should not be thrown to fail stack update.

@midN
Copy link

midN commented Sep 22, 2019

I contacted support, they told me to use other tools to manage EKS pretty much 🤦‍♂ I guess that's how much AWS is going to focus on CFN, time to drop it completely.

@iAnomaly
Copy link

@tabern Any updates? Can this issue please be reopened while you/AWS are/is investigating?

@tabern
Copy link
Contributor

tabern commented Sep 26, 2019

@iAnomaly @vincentheet @jia2 can you please open a new issue to track this? My understanding here is that the CFN template may not be looking at the patch version during the update and is thus failing. Want to apologize for any anguish this has caused, we want CFN to be a first class citizen for EKS and we have work lined up for end of 2019/early 2020 to address this and other areas where we can improve the capabilities for CFN to manage EKS clusters.

@vincentheet
Copy link

@tabern I opened a new issue as requested: #497 It's good to hear that CFN support is going to be improved.

@tabern
Copy link
Contributor

tabern commented Sep 26, 2019

Thanks @vincentheet - I'll pull that onto the roadmap and we can track status there.

@vpuria
Copy link

vpuria commented Apr 6, 2020

@dcherman EKS supports in-place cluster upgrades via the EKS API (https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html) and worker node updates via Cloudformation (https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html) - shipped as part of #21.

Does this resolve your issue or are you thinking of additional/different functionality for cluster upgrades?

Hello Tabren ,

i am planning to deply EKS cluster with quickstart however want to know about future upgrade related problems and changes in the environment . How to do the further upgrades and migrations

@amitkatyal
Copy link

@vpuria, I've deployed EKS cluster with quickstart and facing issues with the upgrade.
Are you able to do further upgrades using quickstart?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
Development

No branches or pull requests