-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AddPeer method to add additional peers after initial startup #2019
Conversation
This way you can add more peers without having to stop/start the cluster. Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>
f0141de
to
b1cce6b
Compare
This looks unused, where do you plan on adding this to the alertmanager? |
I don't have plans to use this in AM directly (although this would enable refreshing peers without a restart). I'm trying to use the This |
In general we're okay with making private things public to facilitate reuse elsewhere, but not adding code that's dead for us. |
the underlying memberlist library handles adding new peers when a new member connects. For example, if you start up 3 instances of AM that know only about each other, starting a 4th member later who knows of any of the other 3 instances will result in them all being in the same cluster. I would suggest building on top of |
So is your suggestion to fork that library out? If so, would you guys want to maintain that in prom or should I just move it to my personal space? cc @fabxc since you did the initial implementation -- might have some opinions. |
@stuartnelson3 I'm working on migrating another project from weaveworks/mesh -> memberlist (kubernetes/kops#7436) which was the same transition AM went through (#1232). This
So as mentioned before I'm happy to fork out the library (as it seems very generally useful) -- I just figured this would be something AM would be interested in keeping (or maybe just as another repo in the prometheus project?). |
If you were to use this library, this would be correct as it's code from the AM. Metric names don't change due to being in a different binary. |
Fair enough; so really the piece I'm missing to use the library is this single method -- so the question is would you be okay to merge this? Or should I split the cluster stuff out? |
I'm not sure what you mean by this. You can already add more peers without having to stop and restart the cluster. Every new instance you start, as long as it knows the address of one other running instance, will join the cluster that running instance is in. |
The issue comes when there is some sort of split-brain. If the cluster splits there is no re-discovery period that will get them to eventually un-split :) So in my usage we have a background goroutine that does that rediscover every so often to re-add missing nodes that we know about (from our seed list). Old implementation for reference: https://github.com/kubernetes/kops/blob/master/protokube/pkg/gossip/mesh/gossip.go#L114 |
And to clarify more |
whenever a peer joins, be it from the initial list or joining later, it is added to the peers slice via the callback executed by if connection to a peer is lost, if this is not working, please file a bug report so we can fix it |
This reconnect is done until a timeout where it is then removed (https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L245) -- which would mean if the split brain lasted for longer than the timeout the cluster would require manual intervention to restart some peers to get them re-joining. In my particular case removing that timeout is not viable either, as the nodes come and go pretty regularly -- so I only want some nodes (order 10 -- as opposed to cluster size of 1k+) to always attempt reconnection. |
this doesn't really fit in with the alertmanager's code, then. I would recommend either forking what we have, or looking at the code in serf, since the code here is just a lightly adapted version of that. |
I did take a look at serf, but it seems to be too opinionated for my use-case; Since this doesn't seem to be of interest I'll just fork the cluster lib into a separate repo. |
For anyone else that has a similar need my fork is at https://github.com/jacksontj/memberlistmesh |
This way you can add more peers without having to stop/start the cluster.