-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap protokube's gossip implementation from weaveworks/mesh to memberlist #7436
Comments
I have a WIP implementation on my branch (which is based on the fork we use) -- https://github.com/wish/kops/compare/release-1.12_fork...jacksontj:gossip_dns?expand=1 As of now you can create a cluster with the new gossip setup and I have tested up to 800 peers in the cluster with <3% cpu utilization (compared to 60-100%) and ~60mb of RAM (compared to ~4gb). At this point the main piece missing is the config plumbing. Since this is an entirely different gossip protocol it needs to run on different ports etc. So I'm thinking the easiest mechanism would be to (1) add a flag for which gossip to use (probably have to keep the default on mesh for now -- not sure how we'd change a default like that). As for migrating a cluster we have 2 options (1) we make protokube spawn N gossips with config -- so you could add the second then remove the first or (2) we just document a somewhat manual procedure where you manually start protokube on the box a second time with different flags to do the migration. I imagine the first is preferable -- but is significantly more work. Any feedback would be greatly appreciated :) |
@justinsb what is your opinion to this? Do you see some possible problems? |
After thinking some more I think I'll have to make protokube support 2 at a time. I'm thinking basically to add the following flags (names aren't set, just conveying the idea):
This way the switch would be (1) add second gossip to masters (2) switch primary gossip on masters (3) swap nodes primary (4) remove secondary from master The alternatives seem to all end up requiring a lot of manual hand holding which will make the upgrade process more painful, unfortunately this approach just adds a bunch more options but is probably more likely to get people to upgrade. |
@jacksontj , I see that the implementation of this is checked in. The protokube config options are Can you list the exact set of steps that could be used for migrating an existing cluster using mesh to one using memberlist? I can test this for you and report the results here. |
Definitely, I was planning on writing up some docs, but as you have noticed I haven't had the time to write it up nicely yet :) So here are the raw notes I used when upgrading:
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen @jacksontj are you still around and want to get this into shape? :) |
@olemarkus: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We discussed this in office hours. The suggestion is to investigate whether we can simplify the gossip stack, by using an approach inspired by the no-DNS work: #14711 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1. Describe IN DETAIL the feature/behavior/change you would like to see.
I just finished spending a few days looking into some weird scale issues with weaveworks/mesh (#7427) -- and after doing so I see a LOT problems that make me question its viability as a kops component. Specifically the issue I've hit is that it has issues once you hit ~200 nodes in the cluster -- which isn't all that large. Other projects have actually moved off of mesh (e.g. alertmanager.
I imagine that we'd need to actually add support for both and have flags to swap between gossip implementation, but that is all doable. So if this is something people are open to I would be up for spending some time to make it happen.
The text was updated successfully, but these errors were encountered: