Discrete etcd cluster #544

colhom · 2016-06-13T16:34:35Z

Work in progress- no review necessary yet (unless you're bored ;)

wolfeidau · 2016-06-28T09:04:28Z

Would love to test this once your happy with it, if your looking for volunteers.

colhom · 2016-07-05T21:22:51Z

@wolfeidau the TLS stuff still isn't done yet, not usable. I've been slacking on this one but will circle back very soon- if you have some time, I'm accepting sub-PRs on colhom:discrete-etcd-cluster fork branch ;)

mumoshu · 2016-07-22T23:06:58Z

@colhom What a PR! Great job. Btw, would you mind sharing me which part of TLS stuff isn't done?(I've read through your code but couldn't figure it)

mumoshu · 2016-07-22T23:20:00Z

multi-node/aws/pkg/config/templates/cluster.yaml

@@ -60,6 +60,8 @@ kmsKeyArn: "{{.KMSKeyARN}}"
 # Price (Dollars) to bid for spot instances. Omit for on-demand instances.
 # workerSpotPrice: "0.05"

+# Instance type for etcd node
+


nit: Missing descriptions and examples for newly added parameters

#etcdInstanceType: "m3.medium" # Number of etcd nodes(should be 1, 3, 5 and so on to prevent split-brain) # etcdCount: 1 # Disk size (GiB) for etcd node #etcdRootVolumeSize: 30 # Disk size (GiB) for etcd node. This one is user by Etcd to store its key-value data and write-ahead logs #etcdDataVolumeSize: 30

mumoshu · 2016-08-02T00:15:09Z

multi-node/aws/pkg/config/config.go

+			dnsSuffix = "ec2.internal"
+		} else {
+			dnsSuffix = fmt.Sprintf("%s.compute.internal", config.Region)
+		}


Encountered this yesterday and have patched exactly the same as you did 😄 Great.

colhom · 2016-08-02T01:22:21Z

@wolfeidau go ahead a give it a whirl! I recommend jumping ahead to #596 and grabbing the HA control plane stuff as well on top of this PR- should all be functional.

colhom · 2016-08-04T18:42:08Z

Hi everyone- update on things.

I've been experimenting with full cluster upgrades and frankly- statically defined etcd clusters (ie: this PR) do not work well with CloudFormation upgrades.

The only strategy that really works well "in general" with CloudFormation updates is doing a rolling replacement on an autoscaling group- and currently we don't have a good solution for running etcd robustly in an ASG.

We're currently investigating various possibilities for implementing a dynamic/managed etcd cluster- maybe in an ASG, or more optimally hosting etcd on kubernetes and greatly simplifying the CloudFormation stack as a result. The eventual goal is to entirely decouple Kubernetes cluster level upgrades (control plane, kubelet, etcd) from updating the underlying AWS virtual hardware. Etcd on Kubernetes would be a step in the right direction.

Input?

colhom · 2016-08-04T18:55:52Z

The band-aide for now would be to manually ensure that the etcd instances are never updated by cloudformation after the are created. This is crappy for a lot of reasons, but the only workable solution until we can have etcd nodes that aren't pets. At least CoreOS can take care of doing the OS upgrades 👍

On kube-aws up --upgrade, we could pull the existing resource definitions for the etcd instances and splice that in to our outgoing template body in the update stack api call- ensuring no doomed-to-fail update action is taken on those instances.

It's ugly, but it's a grokable hack that would allow us to limp along and take care of our named etcd pets until we have etcd on Kubernetes ready for prime time.

\cc @mumoshu @pieterlange

pieterlange · 2016-08-04T21:01:15Z

@colhom i'm personally looking into implementing https://crewjam.com/etcd-aws/ as my backing cluster. I actually had a bad experience today losing etcd state across a autoscaling group managed by monsanto's solution.

This is after 3 months+ of running perfectly etcd-aws-cluster btw. There's a lot of work involved in getting software production ready and @crewjam's solution seems the most fleshed out, so after tryouts chances are i'll be going with that. The integrated backup (recovery from complete cluster failure) is the selling point for me after today.

colhom · 2016-08-04T23:28:36Z

@pieterlange sorry to hear about your data loss. Out of morbid curiosity- did your etcd instances fail health checks due to load, causing the ASG to replace them one-by-one until your data was gone?

pieterlange · 2016-08-05T07:39:39Z

Nothing like that, but i did run them (3) on t2.nano's. (with 20 workers and 1 master) I spun up the etcd cluster separately and never really "managed" it.. it's a small miracle i got away with it as long as i did :)

crewjam · 2016-08-07T18:33:15Z

@pieterlange and everyone -- I'd be happy to work with you on this. We did a bit of research and planning on production use of etcd on AWS which resulted in the repo & blog describing it. One caveat though -- for reasons not at all related to the work, our focus shifted a bit and so we haven't actually put it into production. We got close, and we are still planning to, but it hasn't happened just yet.

colhom · 2016-09-08T20:06:10Z

@dghubble @aaronlevy @pbx0 PTAL

aaronlevy · 2016-09-08T23:55:23Z

multi-node/aws/pkg/cluster/cluster.go

+	Resources map[string]map[string]interface{} `json:"Resources"`
+}
+
+func (c *Cluster) lockEtcdResources(cfSvc *cloudformation.CloudFormation, stackBody string) (string, error) {


Can you add a bit more documentation explaining why this is necessary / how the functionality will be replaced (for those of us unfamiliar with cloudformation).

Problem: We must ensure that a cloudformation update never affects the etcd cluster instances.

Reason: When updating ec2 instances, cloudformation may choose a destroy and replace strategy, which has the potential to corrupt the etcd cluster.

Solution: lockEtcdResources is a 50-line function which only runs on cluster update- it queries the current state of the etcd cluster and ensures the definition does not change across CloudFormation updates.

Concerns over fragility: This is a simple, mechanical operation- but even those can go wrong.

Assuming:

lockEtcdResources() errors out: the update will fail

lockEtcdResources() munges the etcd cluster definition: In that case, Cloudformation will attempt to update the etcd cluster instances and fail because they are marked as "non-updatable" when first created. The CloudFormation update will then roll-back, and the user will be left with a non-updatable cluster

In either case, the failure mode is "cluster non-updatable"- which is exactly where every user is today anyways.

^^ Something to that effect?

dghubble · 2016-09-09T04:32:38Z

Is the separate etcd optional? Are you planning to remove/merge this feature with self-hosted Kubernetes, they seem at odds? Also, the addition of etcd TLS here seems orthogonal.

pieterlange · 2016-09-09T05:50:28Z

Self-hosted would be really nice, but speaking as a lowly sysadmin: there's enough moving parts in this system already. I need to be able to rely on/trust a stable etcd. I'm even proposing to leave this outside of the management of kube-aws altogether in #629!

aaronlevy · 2016-09-09T21:22:47Z

multi-node/aws/pkg/config/templates/cloud-config-etcd

+
+      for encKey in $(find /etc/etcd2/ssl/*.pem.enc);do
+        tmpPath="/tmp/$(basename $encKey).tmp"
+        docker run --net host --rm -v /etc/etcd2/ssl:/etc/etcd2/ssl --rm quay.io/coreos/awscli aws --region {{.Region}} kms decrypt --ciphertext-blob fileb://$encKey --output text --query Plaintext | base64 --decode > $tmpPath


would prefer that we use rkt here

colhom · 2016-09-09T21:44:05Z

Is the separate etcd optional?

No , it's very much mandatory. If etcd is not on separate machines, Kubernetes ec2 instances cannot be updated, which is a big deal

It's a pre-req for dynamically scaling compute resources for both control plane and worker pool- something AWS users (including me!) are very excited about.

Until self-hosted etcd is stable and production ready, this is a requirement.

Are you planning to remove/merge this feature with self-hosted Kubernetes, they seem at odds?

Bootkube currently provisions a static single-node etcd instance and points the control plane at it- and that's really it. A single point of non-fault-tolerant failure for your otherwise fault-tolerant control plane.

If anything, this would complement self-hosted by making etcd more robust.( > 1 nodes, multi-zone, etc) and really not change anything about how self-hosted works currently, as neither bootkube nor this PR deal at all with self-hosting etcd.

Also, the addition of etcd TLS here seems orthogonal

If an etcd interface is exposed to any sort of network, TLS encryption is a must. We have just exposed etcd to the subnet, hence encyption becomes mandatory. If it's only listening on localhost, not as big of a deal.

@dghubble

colhom · 2016-09-09T21:48:08Z

Update- just did some fishing around it bootkube- it appears the multi-node bootkube example actually uses a discrete/external multi-node etcd cluster- just like this PR! So, in reality, this would make bootkube on AWS look almost exactly like the multi-node vagrant setup, just plus TLS!

aaronlevy · 2016-09-09T22:00:40Z

@colhom that review pass is done. Mostly questions at this stage.

@dghubble @pieterlange I'm personally on the fence about this functionality as well - but @colhom is championing for this to be included, and I won't explicitly block on "I feel iffy about this" if @colhom is willing to support long-term.

My general concerns:

This feels somewhat like a big workaround. If cloud-formation does not support this type of deployment, why are we fighting it to try? Parsing out the etcd-sections to keep them from being changed really does seem fragile to me - but to be fair I'm not very familiar with cloud-formation.

We have near term plans to moving toward self-hosted installations and this diverges from that work. This may be fine, it just means we will likely end up with a separate tool (which to be fair, may have happened anyway).

kube-aws could generate multiple cloud-formations in the case that you want a larger etcd cluster. I'm still unclear what we gain from trying to force this all into a single deployment (when we have to work around the fact that it doesn't actually work natively).

The single master likely works well for a lot of people (kube-up, gke, kops, kube-aws (until now)). I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

aaronlevy · 2016-09-09T22:03:55Z

Also, fwiw etcd access was always available to the worker nodes (for flannel) -- so the TLS stuff isn't just now necessary due to running multiple etcd nodes -- it's always been a need. But along those same lines - as of v1.4 (and self-hosted) we're moving toward no worker etcd access at all - so another temporary divergence here.

aaronlevy · 2016-09-09T22:51:29Z

multi-node/aws/pkg/config/templates/cloud-config-etcd

+    owner: root:root
+    content: |
+      #!/bin/bash -e
+      if [[ "$(wipefs -n -p $1 | grep ext4)" == "" ]];then


If you actually want to do this (it scares me) use full paths to binaries and long flag names

I'm trying to get a less sketchy solution in place.

colhom · 2016-09-09T23:47:36Z

multi-node/aws/pkg/config/templates/cloud-config-etcd

@@ -61,7 +61,7 @@ coreos:
        [Service]
        Type=oneshot
        RemainAfterExit=yes
-        ExecStart=/opt/bin/ext4-format-volume-once /dev/xvdf
+        ExecStart=/usr/sbin/mkfs.ext4 -p /dev/xvdf


@aaronlevy final answer- wdyt?

disregard my previous comment- after talking to crawford I've reverted to first checking with blkid, and then leaving the -p arg to mkfs for extra safety.

dghubble · 2016-09-10T00:10:12Z

@colhom Sure thing. It sounds perfectly reasonable for kube-aws to focus on the needs of this particular cluster design, platform, and users. My notes just highlight conflicts between different cluster designs, but it sounds like that's a non-goal, please disregard.

colhom · 2016-09-13T18:27:42Z

@aaronlevy any more review items? I've addressed the conditional ext4 formatting logic in a significantly more robust way in the latest two commits, and am running a conformance test on this now.

pieterlange · 2016-09-15T20:07:20Z

I do not wish to delay this feature because i think the work is just fine. But maybe it would be better to spin up etcd as a separate cloudformation template - this is exactly what i've been doing and what's reflected in my PR (#629). I guess it kind of depends on if you can live with having to spin up mutliple cloudformation templates. I can :-)

colhom · 2016-09-15T20:49:15Z

@pieterlange I have experimented with the separate cf stack approach for etcd- there is too much added complexity in cross-correlating the resources and really not much to be gained (in my opinion).

I've discussed with @aaronlevy and this will not merge into master until after the v1.4 hyperkube and associated changes are in (weeks).

To unblock further development in this area, we will be merging this PR, HA control plane, cluster upgrades and @mumoshu 's kubectl drain PR on a separate experimental branch. After master is updated to v1.4, we will look at merging the experimental stuff back in.

I'll update the community once that branch is ready and passing conformance/e2e tests.

aaronlevy · 2016-09-15T20:53:32Z

I am still wondering about this:

I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

aaronlevy · 2016-09-15T20:55:34Z

Also, can you expand on:

there is too much added complexity in cross-correlating the resources

Is it not just etcd network addressability (hostnames/ips) that needs to be shared?

pieterlange · 2016-09-15T21:02:35Z

Is it not just etcd network addressability (hostnames/ips) that needs to be shared?

that's basically all i'm doing in the PoC PR. Obviously the situation is a little more hairy than that and my PR is just a complete hack. But it worked fine for me.

colhom · 2016-09-15T23:30:04Z

I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

If a CloudFormation update decides it needs to replace a machine, there is no way to ensure the new machine to receives the same EBS volume as the old machine. So our single controller/ebs backed etcd combo is non-updatable via CloudFormation. More generally, any machine running etcd is non-updatable, whether or not it's EBS backed. That's why this PR separates etcd out and quarantines the instances off from updates, until we can host on Kubernetes.

Also, can you expand on:

there is too much added complexity in cross-correlating the resources

Beyond what @pieterlange mentioned, additional points of difficulty:

security group references for ingress rules between two stacks would have to be done manually by making read calls to cloudformation API.
having the stacks share subnets is quite tricky, so in reality we'd need separate subnets for etcd instances and k8s instances, each managed by their respective stack. Then there would be ACL's in the middle, along with shared VPC. Basically means you have to create the stacks in a specific order and tear them down in a specific order, with some manual steps ahead of time to "detach" them from eachother. Right now we have a single API call to create, and a single api call to destroy.

In general, breaking the "everything is in this single cloudformation stack" invariant greatly changes the nature of kube-aws, and is out of scope for a problem that is will be solved in a better way within the next 6 months (or so ™️).

The hackery necessary to quarantine etcd instances from cloudformation updates boils down to masking updates to a few select subtrees of a cloudformation json document. It fits neatly within the confines of a single function which is referenced in one place. You can literally comment out the function and it's one reference, and it's gone.

\cc @aaronlevy

pieterlange · 2016-09-16T08:59:16Z

Against my best intentions i feel like my comment has started to derail/undermine the good work being done here. I trust @colhom's judgment on this item.

I suggest discussing the pro's and con's of managing etcd under a separate cloudformation under my PR or a new issue so we don't further pollute this thread. I like the idea of hosting etcd separately and robustly but that's only because of 1 nasty prior experience.

mumoshu · 2016-09-29T07:19:58Z

@colhom Thanks for your great work here and there. I'm really looking forward to see this get merged 👍
Sorry for bothering you but did you have any chance to set the experimental branch up?
I'd like to experiment #667 on top of it.

dzavalkinolx · 2016-11-02T09:48:35Z

We created our fork of kube-aws and merged this requested. After ~3 months of experience with I have to admit that this is not a way to go in production. Because 3 etcd nodes have static IPs there is no option to update them without some manual interaction. Since there is no backup of etcd state to somewhere outside cluster (for example S3) manual interactions are quite risky. I decided not to upgrade etcd nodes (by manually editing generated stack template json file). As a result now controller and worker nodes in one of the clusters I manage are running latest CoreOS AMI with Dirty COW bug fixed but etcd nodes run old AMI.

I believe the right approach is described in this article: https://crewjam.com/etcd-aws/, i.e. separate ASG for etcd nodes and etcd state mirroring/backup to S3. In this case etcd nodes can be also upgraded by running cloudformation stack update without any manual interaction.

camilb · 2016-11-02T10:10:45Z

@dzavalkinolx I also forked kube-awsand merged #608 and #629 plus some changes. Also started to integrate crewjam's etcd-aws. Got it working but without TLS for now. I see more people are looking on this solution and I think the best place to track the progress is here: kubernetes-retired/kube-aws#9

pieterlange · 2016-11-02T10:49:35Z

@dzavalkinolx @camilb i made a new issue for this at kubernetes-retired/kube-aws#27. I think we're not far off from a solution but it's nice to hear everyone's input on this.

colhom mentioned this pull request Jun 13, 2016

Dedicated etcd instances (in multiple AZs) #525

Closed

aaronlevy changed the title ~~Discrete etcd cluster~~ WIP: Discrete etcd cluster Jun 27, 2016

mumoshu reviewed Jul 22, 2016
View reviewed changes

colhom force-pushed the discrete-etcd-cluster branch from 857324c to c5c91b8 Compare August 1, 2016 22:07

mumoshu reviewed Aug 2, 2016
View reviewed changes

This was referenced Aug 2, 2016

[WIP] Ha control plane #596

Closed

Production Quality Deployment #340

Closed

colhom force-pushed the discrete-etcd-cluster branch from 278d69d to 56a77fe Compare August 8, 2016 23:39

colhom mentioned this pull request Aug 9, 2016

Cluster upgrades #608

Closed

colhom force-pushed the discrete-etcd-cluster branch from 56a77fe to ca749f8 Compare August 30, 2016 20:14

colhom force-pushed the discrete-etcd-cluster branch from ca749f8 to bbe383e Compare September 8, 2016 20:03

colhom changed the title ~~WIP: Discrete etcd cluster~~ Discrete etcd cluster Sep 8, 2016

aaronlevy reviewed Sep 8, 2016
View reviewed changes

aaronlevy reviewed Sep 9, 2016
View reviewed changes

colhom force-pushed the discrete-etcd-cluster branch from 795e45f to 011ad7e Compare September 9, 2016 23:46

colhom reviewed Sep 9, 2016
View reviewed changes

colhom force-pushed the discrete-etcd-cluster branch from 011ad7e to 795e45f Compare September 9, 2016 23:56

WIP discrete etcd cluster

985e451

colhom force-pushed the discrete-etcd-cluster branch from 795e45f to 985e451 Compare September 16, 2016 00:06

colhom closed this Nov 4, 2016

pieterlange mentioned this pull request Nov 4, 2016

move kube-aws development to dedicated repository #751

Merged

Discrete etcd cluster #544

Discrete etcd cluster #544

Conversation

colhom commented Jun 13, 2016

wolfeidau commented Jun 28, 2016

colhom commented Jul 5, 2016

mumoshu commented Jul 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colhom commented Aug 2, 2016

colhom commented Aug 4, 2016

colhom commented Aug 4, 2016

pieterlange commented Aug 4, 2016

colhom commented Aug 4, 2016

pieterlange commented Aug 5, 2016

crewjam commented Aug 7, 2016

colhom commented Sep 8, 2016

Choose a reason for hiding this comment

colhom Sep 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dghubble commented Sep 9, 2016 • edited Loading

pieterlange commented Sep 9, 2016

Choose a reason for hiding this comment

colhom commented Sep 9, 2016

colhom commented Sep 9, 2016

aaronlevy commented Sep 9, 2016

aaronlevy commented Sep 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dghubble commented Sep 10, 2016

colhom commented Sep 13, 2016

pieterlange commented Sep 15, 2016

colhom commented Sep 15, 2016

aaronlevy commented Sep 15, 2016

aaronlevy commented Sep 15, 2016

pieterlange commented Sep 15, 2016

colhom commented Sep 15, 2016

pieterlange commented Sep 16, 2016

mumoshu commented Sep 29, 2016

dzavalkinolx commented Nov 2, 2016

camilb commented Nov 2, 2016

pieterlange commented Nov 2, 2016

mumoshu commented Jul 22, 2016 •

edited

Loading

colhom Sep 9, 2016 •

edited

Loading

dghubble commented Sep 9, 2016 •

edited

Loading

aaronlevy commented Sep 9, 2016 •

edited

Loading