Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Production Quality Deployment #9

Closed
16 of 22 tasks
pieterlange opened this issue Oct 29, 2016 · 23 comments
Closed
16 of 22 tasks

Production Quality Deployment #9

pieterlange opened this issue Oct 29, 2016 · 23 comments

Comments

@pieterlange
Copy link
Contributor

pieterlange commented Oct 29, 2016

Copy of the old issue with a lot of boxes ticked thanks to contributions by @colhom @mumoshu @cgag and many others. Old list follows, will update where necessary.

The goal is to offer a "production ready solution" for provisioning a coreos kubernetes cluster on AWS. These are the major functionality blockers that have been thought of:

@pieterlange
Copy link
Contributor Author

Initial comments on this:

The etcd setup still needs some love. I have a monkeypatch at https://github.com/pieterlange/kube-aws/tree/feature/external-etcd for pointing the cluster to an external etcd cluster (i'm using https://crewjam.com/etcd-aws/). This is not a clean solution. I think we need to:

Some work is being done to have an entirely self-hosted kubernetes cluster (with etcd running as petset in kubernetes itself?) but from an ops PoV this feels like way too many moving parts at the moment.

As for elasticsearch/heapster i tend to move in the exact opposite direction: i'd rather host elasticsearch inside of the cluster. I'm also not sure if this should be part of a default installation.

@camilb
Copy link
Contributor

camilb commented Oct 29, 2016

@pieterlange Tried to integrate your work on "coreos/coreos-kubernetes#629" with "coreos/coreos-kubernetes#608" and https://github.com/crewjam/etcd-aws , a couple of weeks ago.

Current setup:

  • Controllers in ASG with External ELB
  • Workers in ASG.
  • ETCD nodes in ASG with Internal ELB.
  • Moved all the cloudformation definitions from etcd-aws into stack-template.json in kube-aws, so you can configure everything from kube-aws.

Without TLS it works great, the cluster recovers fine. With TLS, ETCD works fine but doesn't recover and also doesn't remove terminated nodes. Still need to fix these in etcd-aws, especially in the backup part.

Apart from this, the DNS record for ETCD internal ELB is still a manual process at the moment (I set an alias record after the ELB is created), but this can be quickly fixed after.

If anyone is interested working on this, maybe can pick some changes I already made on etcd-aws and kube-aws.

https://github.com/camilb/etcd-aws/tree/ssl

https://github.com/camilb/coreos-kubernetes/tree/etcd-asg

@pieterlange
Copy link
Contributor Author

This is great! Thanks @camilb, this will definitely save time if the project goes in that direction.

For reference, there's some notes on self-hosted etcd in the self-hosted design docs. Maybe @aaronlevy can chip in if external-etcd is a good deployment strategy.

@mumoshu
Copy link
Contributor

mumoshu commented Nov 8, 2016

I've posted my thoughts on why I might want to have the "Dedicated controller subnets and routetables" thing at #35 (comment)

@aholbreich
Copy link

aholbreich commented Nov 15, 2016

Support deploying to existing VPC (and maybe existing subnet as well?) -- DONE #346

where is this referenced? Where to read about?

@pieterlange
Copy link
Contributor Author

This was not properly linked @aholbreich, but it's in coreos/coreos-kubernetes#346

Deploying to existing subnets was skipped, but if you think you need this please add usecases to #52.

@mumoshu
Copy link
Contributor

mumoshu commented Nov 17, 2016

Just curious but does everyone want auto-scaling of workers based on cluster-autoscaler to be in the list?
Or are you going with AWS native autoscaling?

Currently, cluster-autoscaler wouldn't work as might be expected in kube-aws created clusters.

The autoscaling group should span 1 availability zone for the cluster autoscaler to work. If you want to distribute workloads evenly across zones, set up multiple ASGs, with a cluster autoscaler for each ASG. At the time of writing this, cluster autoscaler is unaware of availability zones and although autoscaling groups can contain instances in multiple availability zones when configured so, the cluster autoscaler can't reliably add nodes to desired zones. That's because AWS AutoScaling determines which zone to add nodes which is out of the control of the cluster autoscaler. For more information, see kubernetes-retired/contrib#1552 (comment).

https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#deployment-specification

Actually, that's why I've originally started the work for #46.

@AlmogBaku
Copy link
Contributor

Regarding heapster/elasticsearch:
Currently, in order to use heapster+elasticsearch you can use the Managed ES by AWS; however, the managed ES doesn't support scripted fields which are very useful(and almost required) for understanding the metrics and settings alarms.

@mumoshu
Copy link
Contributor

mumoshu commented Mar 22, 2017

Set up controller and worker AutoscalingGroups to recover from ec2 instance failures

I believe this is already supported in kube-aws as of today

@mumoshu
Copy link
Contributor

mumoshu commented Mar 22, 2017

Dedicated controller subnets and routetables TBD

This is supported since v0.9.4

@mumoshu
Copy link
Contributor

mumoshu commented Mar 22, 2017

Kubelet TLS bootstrapping

This is WIP in #414

@mumoshu
Copy link
Contributor

mumoshu commented Mar 22, 2017

Node pools #46
Cloudformation update/create success signalling #49

Already supported

@mumoshu
Copy link
Contributor

mumoshu commented Mar 22, 2017

@AlmogBaku Thanks for the information!
That's a huge restriction 😢

@mumoshu
Copy link
Contributor

mumoshu commented Mar 22, 2017

Provision AWS ElasticSearch cluster

Btw I'm using GCP Stackdriver Logging for aggregating log messages from my production kube-aws clusters. When there're much nicer alternatives like Stackdriver, do we really need to support ES out-of-box in kube-aws?

@mumoshu
Copy link
Contributor

mumoshu commented Mar 22, 2017

Automatically remove nodes when instances are rotated out of ASG

@pieterlange Is the above sentence meant for rolling-updates of worker/controller/etcd nodes?

@pieterlange
Copy link
Contributor Author

I don't think we need to support ES in kube-aws, but we could have some recommendations.

Automatically remove nodes when instances are rotated out of ASG

I think this referred to removing the nodes from the cluster state, where still required (eg etcd member lists). Removing/draining kubelets is already supported 👍 .

@danielfm
Copy link
Contributor

Figure out what we're going to do about automated CSR signing in kube-aws (necessary for self-healing and autoscaling)

I think we should take the same approach taken by kube-adm, which is to automatically approve all requests sent via a specific bootstrap token, making sure this token can only be used for CSR via RBAC (already supported by kube-aws).

@danielfm
Copy link
Contributor

danielfm commented Mar 22, 2017

Provision AWS ElasticSearch cluster

I have a working solution for ingesting cluster-wide logs to Sumologic. When I have some time, I could add this to kube-aws as an experimental feature.

The same could be done for GCP Stackdriver Logging, and other vendors.

@cknowles
Copy link
Contributor

One potential problem with logging is the number of solutions out there. I can recommend fluentd-kubernetes-daemonset and will likely be helping add GCP Stackdriver support to that soon. However, I know some have strong opinions on using other logging tools/frameworks. If might be good to provide some recommendations in the docs.

jollinshead referenced this issue in HotelsDotCom/kube-aws Jul 4, 2017
… to hcom-flavour

* commit '28b893f91b55ad07545bcf7c871bccad7be1bbd9':
  @noissue Bump kubbe/coros version
@mumoshu
Copy link
Contributor

mumoshu commented Jul 13, 2017

Enable decommissioning of kubelets when instances are rotated out of the ASG (experimental support for node draining is included now)
Automatically remove nodes when instances are rotated out of ASG ASG

I believe these two in the description are now resolved thanks to @danielfm - a node now pend the rolling-update of an ASG while the node drainer drains pods.

@amitkumarj441
Copy link

@pieterlange I'm willing to take Provision AWS Elasticsearch cluster task. Can you give update on this task? so that I can start working on it and can do PRs soon.

@pieterlange
Copy link
Contributor Author

@amitkumarj441 i'm not very active in kube-aws anymore and will close the issue as most of the items have been fixed nowadays.

I am personally running my elasticsearch clusters inside of kubernetes and i also think thats the best way to go forward, but knock yourself out ;-).

@amitkumarj441
Copy link

Thanks @pieterlange for letting me know about this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants