Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

Discrete etcd cluster #544

Closed
wants to merge 1 commit into from
Closed

Conversation

colhom
Copy link
Contributor

@colhom colhom commented Jun 13, 2016

Work in progress- no review necessary yet (unless you're bored ;)

@aaronlevy aaronlevy changed the title Discrete etcd cluster WIP: Discrete etcd cluster Jun 27, 2016
@wolfeidau
Copy link

Would love to test this once your happy with it, if your looking for volunteers.

@colhom
Copy link
Contributor Author

colhom commented Jul 5, 2016

@wolfeidau the TLS stuff still isn't done yet, not usable. I've been slacking on this one but will circle back very soon- if you have some time, I'm accepting sub-PRs on colhom:discrete-etcd-cluster fork branch ;)

@mumoshu
Copy link
Contributor

mumoshu commented Jul 22, 2016

@colhom What a PR! Great job. Btw, would you mind sharing me which part of TLS stuff isn't done?(I've read through your code but couldn't figure it)

@@ -60,6 +60,8 @@ kmsKeyArn: "{{.KMSKeyARN}}"
# Price (Dollars) to bid for spot instances. Omit for on-demand instances.
# workerSpotPrice: "0.05"

# Instance type for etcd node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Missing descriptions and examples for newly added parameters

#etcdInstanceType: "m3.medium"

# Number of etcd nodes(should be 1, 3, 5 and so on to prevent split-brain)
# etcdCount: 1

# Disk size (GiB) for etcd node
#etcdRootVolumeSize: 30

# Disk size (GiB) for etcd node. This one is user by Etcd to store its key-value data and write-ahead logs
#etcdDataVolumeSize: 30

dnsSuffix = "ec2.internal"
} else {
dnsSuffix = fmt.Sprintf("%s.compute.internal", config.Region)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encountered this yesterday and have patched exactly the same as you did 😄 Great.

This was referenced Aug 2, 2016
@colhom
Copy link
Contributor Author

colhom commented Aug 2, 2016

@wolfeidau go ahead a give it a whirl! I recommend jumping ahead to #596 and grabbing the HA control plane stuff as well on top of this PR- should all be functional.

@colhom
Copy link
Contributor Author

colhom commented Aug 4, 2016

Hi everyone- update on things.

I've been experimenting with full cluster upgrades and frankly- statically defined etcd clusters (ie: this PR) do not work well with CloudFormation upgrades.

The only strategy that really works well "in general" with CloudFormation updates is doing a rolling replacement on an autoscaling group- and currently we don't have a good solution for running etcd robustly in an ASG.

We're currently investigating various possibilities for implementing a dynamic/managed etcd cluster- maybe in an ASG, or more optimally hosting etcd on kubernetes and greatly simplifying the CloudFormation stack as a result. The eventual goal is to entirely decouple Kubernetes cluster level upgrades (control plane, kubelet, etcd) from updating the underlying AWS virtual hardware. Etcd on Kubernetes would be a step in the right direction.

Input?

@colhom
Copy link
Contributor Author

colhom commented Aug 4, 2016

The band-aide for now would be to manually ensure that the etcd instances are never updated by cloudformation after the are created. This is crappy for a lot of reasons, but the only workable solution until we can have etcd nodes that aren't pets. At least CoreOS can take care of doing the OS upgrades 👍

On kube-aws up --upgrade, we could pull the existing resource definitions for the etcd instances and splice that in to our outgoing template body in the update stack api call- ensuring no doomed-to-fail update action is taken on those instances.

It's ugly, but it's a grokable hack that would allow us to limp along and take care of our named etcd pets until we have etcd on Kubernetes ready for prime time.

\cc @mumoshu @pieterlange

@pieterlange
Copy link

@colhom i'm personally looking into implementing https://crewjam.com/etcd-aws/ as my backing cluster. I actually had a bad experience today losing etcd state across a autoscaling group managed by monsanto's solution.

This is after 3 months+ of running perfectly etcd-aws-cluster btw. There's a lot of work involved in getting software production ready and @crewjam's solution seems the most fleshed out, so after tryouts chances are i'll be going with that. The integrated backup (recovery from complete cluster failure) is the selling point for me after today.

@colhom
Copy link
Contributor Author

colhom commented Aug 4, 2016

@pieterlange sorry to hear about your data loss. Out of morbid curiosity- did your etcd instances fail health checks due to load, causing the ASG to replace them one-by-one until your data was gone?

@pieterlange
Copy link

Nothing like that, but i did run them (3) on t2.nano's. (with 20 workers and 1 master) I spun up the etcd cluster separately and never really "managed" it.. it's a small miracle i got away with it as long as i did :)

@crewjam
Copy link

crewjam commented Aug 7, 2016

@pieterlange and everyone -- I'd be happy to work with you on this. We did a bit of research and planning on production use of etcd on AWS which resulted in the repo & blog describing it. One caveat though -- for reasons not at all related to the work, our focus shifted a bit and so we haven't actually put it into production. We got close, and we are still planning to, but it hasn't happened just yet.

@colhom colhom mentioned this pull request Aug 9, 2016
@colhom colhom changed the title WIP: Discrete etcd cluster Discrete etcd cluster Sep 8, 2016
@colhom
Copy link
Contributor Author

colhom commented Sep 8, 2016

@dghubble @aaronlevy @pbx0 PTAL

Resources map[string]map[string]interface{} `json:"Resources"`
}

func (c *Cluster) lockEtcdResources(cfSvc *cloudformation.CloudFormation, stackBody string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit more documentation explaining why this is necessary / how the functionality will be replaced (for those of us unfamiliar with cloudformation).

Copy link
Contributor Author

@colhom colhom Sep 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem: We must ensure that a cloudformation update never affects the etcd cluster instances.

Reason: When updating ec2 instances, cloudformation may choose a destroy and replace strategy, which has the potential to corrupt the etcd cluster.

Solution: lockEtcdResources is a 50-line function which only runs on cluster update- it queries the current state of the etcd cluster and ensures the definition does not change across CloudFormation updates.

Concerns over fragility: This is a simple, mechanical operation- but even those can go wrong.

Assuming:

  • lockEtcdResources() errors out: the update will fail
  • lockEtcdResources() munges the etcd cluster definition: In that case, Cloudformation will attempt to update the etcd cluster instances and fail because they are marked as "non-updatable" when first created. The CloudFormation update will then roll-back, and the user will be left with a non-updatable cluster

In either case, the failure mode is "cluster non-updatable"- which is exactly where every user is today anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ Something to that effect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@dghubble
Copy link
Member

dghubble commented Sep 9, 2016

Is the separate etcd optional? Are you planning to remove/merge this feature with self-hosted Kubernetes, they seem at odds? Also, the addition of etcd TLS here seems orthogonal.

@pieterlange
Copy link

Self-hosted would be really nice, but speaking as a lowly sysadmin: there's enough moving parts in this system already. I need to be able to rely on/trust a stable etcd. I'm even proposing to leave this outside of the management of kube-aws altogether in #629!


for encKey in $(find /etc/etcd2/ssl/*.pem.enc);do
tmpPath="/tmp/$(basename $encKey).tmp"
docker run --net host --rm -v /etc/etcd2/ssl:/etc/etcd2/ssl --rm quay.io/coreos/awscli aws --region {{.Region}} kms decrypt --ciphertext-blob fileb://$encKey --output text --query Plaintext | base64 --decode > $tmpPath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer that we use rkt here

@colhom
Copy link
Contributor Author

colhom commented Sep 9, 2016

Is the separate etcd optional?

No , it's very much mandatory. If etcd is not on separate machines, Kubernetes ec2 instances cannot be updated, which is a big deal

It's a pre-req for dynamically scaling compute resources for both control plane and worker pool- something AWS users (including me!) are very excited about.

Until self-hosted etcd is stable and production ready, this is a requirement.

Are you planning to remove/merge this feature with self-hosted Kubernetes, they seem at odds?

Bootkube currently provisions a static single-node etcd instance and points the control plane at it- and that's really it. A single point of non-fault-tolerant failure for your otherwise fault-tolerant control plane.

If anything, this would complement self-hosted by making etcd more robust.( > 1 nodes, multi-zone, etc) and really not change anything about how self-hosted works currently, as neither bootkube nor this PR deal at all with self-hosting etcd.

Also, the addition of etcd TLS here seems orthogonal

If an etcd interface is exposed to any sort of network, TLS encryption is a must. We have just exposed etcd to the subnet, hence encyption becomes mandatory. If it's only listening on localhost, not as big of a deal.

@dghubble

@colhom
Copy link
Contributor Author

colhom commented Sep 9, 2016

Update- just did some fishing around it bootkube- it appears the multi-node bootkube example actually uses a discrete/external multi-node etcd cluster- just like this PR! So, in reality, this would make bootkube on AWS look almost exactly like the multi-node vagrant setup, just plus TLS!

@aaronlevy
Copy link
Contributor

@colhom that review pass is done. Mostly questions at this stage.

@dghubble @pieterlange I'm personally on the fence about this functionality as well - but @colhom is championing for this to be included, and I won't explicitly block on "I feel iffy about this" if @colhom is willing to support long-term.

My general concerns:

This feels somewhat like a big workaround. If cloud-formation does not support this type of deployment, why are we fighting it to try? Parsing out the etcd-sections to keep them from being changed really does seem fragile to me - but to be fair I'm not very familiar with cloud-formation.

We have near term plans to moving toward self-hosted installations and this diverges from that work. This may be fine, it just means we will likely end up with a separate tool (which to be fair, may have happened anyway).

kube-aws could generate multiple cloud-formations in the case that you want a larger etcd cluster. I'm still unclear what we gain from trying to force this all into a single deployment (when we have to work around the fact that it doesn't actually work natively).

The single master likely works well for a lot of people (kube-up, gke, kops, kube-aws (until now)). I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

@aaronlevy
Copy link
Contributor

aaronlevy commented Sep 9, 2016

Also, fwiw etcd access was always available to the worker nodes (for flannel) -- so the TLS stuff isn't just now necessary due to running multiple etcd nodes -- it's always been a need. But along those same lines - as of v1.4 (and self-hosted) we're moving toward no worker etcd access at all - so another temporary divergence here.

owner: root:root
content: |
#!/bin/bash -e
if [[ "$(wipefs -n -p $1 | grep ext4)" == "" ]];then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you actually want to do this (it scares me) use full paths to binaries and long flag names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to get a less sketchy solution in place.

@@ -61,7 +61,7 @@ coreos:
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/opt/bin/ext4-format-volume-once /dev/xvdf
ExecStart=/usr/sbin/mkfs.ext4 -p /dev/xvdf
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlevy final answer- wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disregard my previous comment- after talking to crawford I've reverted to first checking with blkid, and then leaving the -p arg to mkfs for extra safety.

@dghubble
Copy link
Member

@colhom Sure thing. It sounds perfectly reasonable for kube-aws to focus on the needs of this particular cluster design, platform, and users. My notes just highlight conflicts between different cluster designs, but it sounds like that's a non-goal, please disregard.

@colhom
Copy link
Contributor Author

colhom commented Sep 13, 2016

@aaronlevy any more review items? I've addressed the conditional ext4 formatting logic in a significantly more robust way in the latest two commits, and am running a conformance test on this now.

@pieterlange
Copy link

I do not wish to delay this feature because i think the work is just fine. But maybe it would be better to spin up etcd as a separate cloudformation template - this is exactly what i've been doing and what's reflected in my PR (#629). I guess it kind of depends on if you can live with having to spin up mutliple cloudformation templates. I can :-)

@colhom
Copy link
Contributor Author

colhom commented Sep 15, 2016

@pieterlange I have experimented with the separate cf stack approach for etcd- there is too much added complexity in cross-correlating the resources and really not much to be gained (in my opinion).

I've discussed with @aaronlevy and this will not merge into master until after the v1.4 hyperkube and associated changes are in (weeks).

To unblock further development in this area, we will be merging this PR, HA control plane, cluster upgrades and @mumoshu 's kubectl drain PR on a separate experimental branch. After master is updated to v1.4, we will look at merging the experimental stuff back in.

I'll update the community once that branch is ready and passing conformance/e2e tests.

@aaronlevy
Copy link
Contributor

I am still wondering about this:

I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

@aaronlevy
Copy link
Contributor

Also, can you expand on:

there is too much added complexity in cross-correlating the resources

Is it not just etcd network addressability (hostnames/ips) that needs to be shared?

@pieterlange
Copy link

Is it not just etcd network addressability (hostnames/ips) that needs to be shared?

that's basically all i'm doing in the PoC PR. Obviously the situation is a little more hairy than that and my PR is just a complete hack. But it worked fine for me.

@colhom
Copy link
Contributor Author

colhom commented Sep 15, 2016

I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

If a CloudFormation update decides it needs to replace a machine, there is no way to ensure the new machine to receives the same EBS volume as the old machine. So our single controller/ebs backed etcd combo is non-updatable via CloudFormation. More generally, any machine running etcd is non-updatable, whether or not it's EBS backed. That's why this PR separates etcd out and quarantines the instances off from updates, until we can host on Kubernetes.

Also, can you expand on:

there is too much added complexity in cross-correlating the resources

Beyond what @pieterlange mentioned, additional points of difficulty:

  • security group references for ingress rules between two stacks would have to be done manually by making read calls to cloudformation API.
  • having the stacks share subnets is quite tricky, so in reality we'd need separate subnets for etcd instances and k8s instances, each managed by their respective stack. Then there would be ACL's in the middle, along with shared VPC. Basically means you have to create the stacks in a specific order and tear them down in a specific order, with some manual steps ahead of time to "detach" them from eachother. Right now we have a single API call to create, and a single api call to destroy.

In general, breaking the "everything is in this single cloudformation stack" invariant greatly changes the nature of kube-aws, and is out of scope for a problem that is will be solved in a better way within the next 6 months (or so ™️).

The hackery necessary to quarantine etcd instances from cloudformation updates boils down to masking updates to a few select subtrees of a cloudformation json document. It fits neatly within the confines of a single function which is referenced in one place. You can literally comment out the function and it's one reference, and it's gone.

\cc @aaronlevy

@pieterlange
Copy link

Against my best intentions i feel like my comment has started to derail/undermine the good work being done here. I trust @colhom's judgment on this item.

I suggest discussing the pro's and con's of managing etcd under a separate cloudformation under my PR or a new issue so we don't further pollute this thread. I like the idea of hosting etcd separately and robustly but that's only because of 1 nasty prior experience.

@mumoshu
Copy link
Contributor

mumoshu commented Sep 29, 2016

@colhom Thanks for your great work here and there. I'm really looking forward to see this get merged 👍
Sorry for bothering you but did you have any chance to set the experimental branch up?
I'd like to experiment #667 on top of it.

@dzavalkinolx
Copy link

We created our fork of kube-aws and merged this requested. After ~3 months of experience with I have to admit that this is not a way to go in production. Because 3 etcd nodes have static IPs there is no option to update them without some manual interaction. Since there is no backup of etcd state to somewhere outside cluster (for example S3) manual interactions are quite risky. I decided not to upgrade etcd nodes (by manually editing generated stack template json file). As a result now controller and worker nodes in one of the clusters I manage are running latest CoreOS AMI with Dirty COW bug fixed but etcd nodes run old AMI.

I believe the right approach is described in this article: https://crewjam.com/etcd-aws/, i.e. separate ASG for etcd nodes and etcd state mirroring/backup to S3. In this case etcd nodes can be also upgraded by running cloudformation stack update without any manual interaction.

@camilb
Copy link

camilb commented Nov 2, 2016

@dzavalkinolx I also forked kube-awsand merged #608 and #629 plus some changes. Also started to integrate crewjam's etcd-aws. Got it working but without TLS for now. I see more people are looking on this solution and I think the best place to track the progress is here: kubernetes-retired/kube-aws#9

@pieterlange
Copy link

@dzavalkinolx @camilb i made a new issue for this at kubernetes-retired/kube-aws#27. I think we're not far off from a solution but it's nice to hear everyone's input on this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants