Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

Cluster upgrades #608

Closed
wants to merge 4 commits into from
Closed

Cluster upgrades #608

wants to merge 4 commits into from

Conversation

colhom
Copy link
Contributor

@colhom colhom commented Aug 9, 2016

Complete and working upgrade path for kube-aws clusters, minus the discrete etcd cluster instances.

As part of this, we now have external CA support for TLS asset generation, along with support for allowing user to generate all TLS assets.

Fixes #104 #161
Depends #544 #596
Follow up with #465

Unfortunately does not support upgrading clusters that have already launched. --edit-- by already launched, i mean created by kube-aws code prior to this functionality merging.

@mumoshu I'd like to get your work on node draining on shutdown integrated as well.

\cc @plange @whereisaaron @robszumski @sym3tri @bfallik

Ref #340 #230 #161

@pieterlange
Copy link

This is awesome! I will have to make some time to test this (along with the HA stuff).

@bfallik
Copy link
Contributor

bfallik commented Aug 10, 2016

Looking forward to testing this!

@colhom
Copy link
Contributor Author

colhom commented Aug 10, 2016

@bfallik once #465 is rebased on this PR, you'll have pods drained off nodes before they shutdown and are destroyed.

Cluster upgrades are entirely functional here, but keep in mind that the nodes will be shutdown ungracefully, and consequently requests will be routed to the pod ips for some amount of time after the containers have disappeared.

@robszumski
Copy link
Member

At a high level, I'd like to think more about these commands and flags.

Current (this PR)

The current method in this PR:

$ kube-aws render
$ git diff # view changes to rendered assets
$ kube-aws up --update
$ kube-aws render --generate-credentials
$ kube-aws up --update

This retains the two primary commands that we are used to, but makes them much more complicated. The up really doesn't do the same thing as before, where the user was taught that it brings up a complete stack. Now it just modifies the stack or creates a new stack, based on this flag.

Proposal

I propose changing these names. Here are the same scenarios:

$ kube-aws render stack
$ git diff # view changes to rendered assets
$ kube-aws update stack
$ kube-aws render credentials
$ kube-aws update credentials

Note the use of the same subcommand for each. Makes it easier to teach you the terms and pieces that are involved.

This PR does a great job of separating out the render vs update part, this retains that and makes it even more explicit.

Backwards Compatible

For backwards compatibility, we can alias (but not document) the render command from the last release:

$ kube-aws render         # v0.8.1
$ kube-aws render stack   # master

@@ -0,0 +1,40 @@
# kube-aws cluster updates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fit the naming scheme, can we name this doc kubernetes-on-aws-updates.md?

@colhom
Copy link
Contributor Author

colhom commented Aug 10, 2016

@robszumski great suggestion! i'll think it over more thoroughly while implementing it, but sgtm and I'll move forward with what you have outlined. I was unsure of what to do with the command tree... thanks for figuring it out.

## Types of cluster update
There are two distinct categories of cluster update.

* **Parameter-level update**: Only changes to `cluster.yaml` and/or TLS assets in `credentials/` folder are reflected. To enact this type of update. Modifications to CloudFormation or cloud-config userdata templates will not be reflected. In this case, you do not have to re-render:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To enact this type of update."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, should probably add that.

@colhom
Copy link
Contributor Author

colhom commented Aug 29, 2016

@pieterlange any news on the calico problem you encountered with this PR?

colhom pushed a commit to colhom/coreos-kubernetes that referenced this pull request Aug 30, 2016
@colhom
Copy link
Contributor Author

colhom commented Aug 30, 2016

@pieterlange I've cherry-picked in your commit

colhom pushed a commit to colhom/coreos-kubernetes that referenced this pull request Aug 30, 2016
@colhom
Copy link
Contributor Author

colhom commented Aug 31, 2016

@robszumski check out b4d05dc

@robszumski
Copy link
Member

@robszumski check out b4d05dc

Nice, lookin' good!

@iwarp
Copy link

iwarp commented Sep 7, 2016

How far away is this from being merged? Im keen to start using this.

Does rolling replacement update on controller ASG, followed by workers

Punts on upgrading etcd cluster- simply makes sure resource definitions don't change after create.
render command now operates on stack and credentials independently

add top-level update command
@colhom
Copy link
Contributor Author

colhom commented Sep 8, 2016

The last two commits are this PR. Prior two are for #596 and #544

@colhom
Copy link
Contributor Author

colhom commented Sep 8, 2016

@iwarp we're working on getting this code reviewed! Sorry for the delay

If you're really keen to start using it, it should all be functional if you pull from colhom:cluster-upgrades.

@colhom
Copy link
Contributor Author

colhom commented Sep 9, 2016

Note to self- I also need to add UpdatePolicy stanzas to to Subnets and VPC prohibiting updates to them . To update a subnet or vpc with cloudformation, the whole deployment is essentially replicated in a different availability zone. Kube-aws in general will not be able to support this in the near future.

@colhom colhom mentioned this pull request Sep 9, 2016
@pieterlange
Copy link

I've been using the colhom:cluster-upgrades branch (plus some minor patches for an external etcd cluster) for a little over a week now and it works great. Updating the stack works as expected!

Minor note: the update-policy for the worker autoscaler might need a little increase from the default 2 minutes depending on app startup time. The kubernetes master also has a brief window where it's unavailable but everything recovers just fine

@iamsaso
Copy link

iamsaso commented Sep 29, 2016

Any updates on this? Would love to start using it 🚀

@colhom
Copy link
Contributor Author

colhom commented Sep 29, 2016

An update for all interested parties:

We'll be merging this functionality (along with some of @mumoshu 's work regarding node draining) in an experimental branch in the near future. Work is encouraged in that direction, though the officially released code in master will not receiving this functionality. Going forward, the goal is to orchestrate these critical behaviors via the Kubernetes control plane, rather than via CloudFormation.

@mumoshu
Copy link
Contributor

mumoshu commented Sep 29, 2016

@colhom Thanks for the update!
Does it mean that you and your colleagues won't be focusing on things in experimental branch anymore?
(Btw, in the long term, I agree with the goal you've mentioned 👍 )

@camilb
Copy link

camilb commented Sep 30, 2016

@colhom
Have a working version based on #608, #629 and the latest changes from the master.
At the moment I'm running the tests using :

  • 3 Controllers in Multi-AZ with LoadBalancer
  • 3 external ETCD nodes configured with SSL in Multi-AZ
  • 3 Workers in Multi-AZ
  • CNI
  • Hyperkube v1.4.0_coreos.2

Tested using the latest stable and alpha OS releases.

For my current setup this works pretty good. Next I will try to put ETCD in a Auto Scaling Group with S3 daily backups.

If there is someone interested in it, I have a working branch:
https://github.com/camilb/coreos-kubernetes/tree/1.4.0-ha

@iwarp
Copy link

iwarp commented Oct 3, 2016

Hmmm interesting change of direction. What's the guidance for a highly available cluster that i should be using right now then? I was planning that this PR was going to be complete before going live on a new project

Do i need to create multiple k8s clusters and load balance across which is closer to the k8s federation approach.

How have others approached this?

@apenney
Copy link

apenney commented Oct 4, 2016

Echoing the previous response. I put off deploying kubernetes until this PR was finished but now I find myself unsure how to proceed. I might just look at kops at this point until there's a clearer vision for coreos-kubernetes. We looked at enterprise support for coreos but this project was a blocker for us being able to proceed with that (in case it helps justify anyone spending time on laying out a clear roadmap).

@iamsaso
Copy link

iamsaso commented Oct 4, 2016

I was pushing the date to deploy coreos kube to production and waiting for this PR to land master. We would like to have a procedure to do future updates and this seemed as a good solution. Any plans on providing some guidance on how updates will be done with future new releases?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet