Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Clarify status of bootkube experimental self-hosted etcd #738

Closed
dghubble opened this issue Oct 18, 2017 · 16 comments
Closed

Clarify status of bootkube experimental self-hosted etcd #738

dghubble opened this issue Oct 18, 2017 · 16 comments

Comments

@dghubble
Copy link
Contributor

dghubble commented Oct 18, 2017

I'd like to track the status of --experimental-self-hosted-etcd to address some recurring questions.

bootkube self-hosted etcd (as in, self-hosted the cluster's etcd on the cluster itself) is a challenging balancing act to get right. Its not been done before. There are complex interaction edge cases for keeping self-hosted etcd working.

A while back, bootkube experimental self-hosted etcd efforts were paused, while the etcd team's general etcd-operator work continues (i.e. self-hosting etcd, but not for the cluster's etcd itself). The bootkube --experimental-self-hosted-etcd feature is not currently progressing toward stability or production readiness, but neither is it disappearing (don't panic). Whether efforts will be resumed in the future as etcd-operator and Kubernetes matures is unknown at this time.

At this time there are no changes:

  • Production clusters should continue to use an on-host (traditional) etcd cluster
  • bootkube will still run self-hosted etcd e2e tests and try to keep them working. Best effort.
  • If you are using self-hosted etcd for the Kubernetes etcd, it would be prudent to consider gradually getting back to on-host etcd. We cannot make promises about its stability if you run into dragons.
@aaronlevy
Copy link
Contributor

cc @xiang90 @hongchaodeng

@xiang90
Copy link
Contributor

xiang90 commented Oct 18, 2017

feature is not currently progressing toward stability or production readiness

This is not true. The major problem is not self hosted etcd itself, but etcd HA setup for Kubernetes in general. It is challenging for self hosted etcd, since users would expect etcd/apiserver to recover from failures automatically (which requires manual restarts of api server/etcd right now in non-self-hosted mode)

The failing test simply shows that the API server fails to reconnect to etcd after failure injection. We are working with Kubernetes upstream to get etcd HA setup stable, so that we can continue the work.

Simply speaking, it is still progressing.

@xiang90
Copy link
Contributor

xiang90 commented Oct 18, 2017

I agree with the rest of the suggestions.

@dghubble
Copy link
Contributor Author

I've tried to draw the distinction between general self-hosted etcd work (which is progressing towards stability) and bootkube's self-hosted etcd feature, which is the harder challenge of self-hosted etcd running on Kubernetes and also being the etcd cluster for the cluster and interacting well (which has complexities related to Kubernetes itself, I'm not blaming etcd). That's what's being discussed here.

I think we're all on the same page. Up until this point we've used the term "self-hosted etcd" for both. Didn't mean to imply your operator work was in question.

@xiang90
Copy link
Contributor

xiang90 commented Oct 18, 2017

What I am saying is that the work being done on Kubernetes helps both self hosted and non-self hosted. And that work currently blocks the progress of self hosted more for the reason I just said. Once things get unblocked, most of the issues we see here (in bootkube tests) will resolve itself, and requires almost no action. But, yes, we are not actively changing anything in bootkube for self hosted etcd case.

@dghubble
Copy link
Contributor Author

Interest in looking into the flakes we're seeing on the self-hosted etcd setup?

@janwillies
Copy link
Contributor

@xiang90 thanks for the background infos, much appreciated. Do you have a link to the upstream discussion?

@jamiehannaford
Copy link
Contributor

jamiehannaford commented Oct 24, 2017

@dghubble @xiang90 What are some of the edge cases that prevent the etcd-operator being used to manage the cluster etcd? Does this mean the operator is not recommended for cluster etcd, or that you're cautious about the effect it has on SLAs etc?

If it's a lack of HA functionality, my preference would be to work on that upstream rather than pivotting towards other form factors. I think there's a lot of value in using the operator (declarative upgrades, rollbacks, backups, etc.)

@aaronlevy
Copy link
Contributor

aaronlevy commented Oct 25, 2017

We recently disabled the self-hosted etcd tests, but opened an issue tracking that we should consider re-enabling them in the future when we can spend more time on the feature: #748

@klausenbusk
Copy link
Contributor

I just experienced something weird with our k8s cluster yesterday (bootkube v0.7.0) and self-hosted etcd.

The kube-etcd and kube-etcd-client service had somehow vanished, so the cluster kind of broke down. kubectl get pod worked with certificate, but everything requiring etcd (write?) seemed to be broken.

After adding kube-etcd-000{1...3}.kube-etcd.kube-system.svc.cluster.local to every etcd pod and etcd-operator, the k8s cluster was working again. I then copied the kube-etcd/kube-etcd-client service spec from a test cluster and everything was now working as expected again.

But this does not yield much confidence to the self-hosted etcd.. :/

@dghubble
Copy link
Contributor Author

dghubble commented Nov 7, 2017

I've dropped the self-hosted etcd option from projects matchbox#655 and typhoon#13 if that clarifies things for anyone. Updated the issue to reflect that self-hosted etcd is no longer tested as well.

@ericchiang
Copy link
Contributor

Self-hosted etcd has been removed from bootkube for the foreseeable future #828

@carlos-licea
Copy link

Naïve question, why was this removed? I've been requested to have a similar setup. I have to wonder, what's the case for on-host etcd rather than self-contained?

@redbaron
Copy link
Contributor

@carlos-licea this very thread wholly is dedicated to your question

@carlos-licea
Copy link

@redbaron not trying to be dismissive but this thread is a bit vague and I cannot use it as-is to argue we should have in-host etcd's other than say, "weird things might happen".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants