AddPeer method to add additional peers after initial startup #2019

jacksontj · 2019-09-04T17:38:09Z

This way you can add more peers without having to stop/start the cluster.

This way you can add more peers without having to stop/start the cluster. Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

brian-brazil · 2019-09-04T17:42:15Z

This looks unused, where do you plan on adding this to the alertmanager?

jacksontj · 2019-09-04T17:45:29Z

I don't have plans to use this in AM directly (although this would enable refreshing peers without a restart). I'm trying to use the cluster library in another project. If you'd prefer I can split the dep out to another repo (or fork it) -- but I figured this was a very useful small addition.

This cluster library really is perfect for any other project that is moving from weaveworks/mesh -> memberlist -- which is exactly what I'm doing :)

brian-brazil · 2019-09-04T17:47:07Z

In general we're okay with making private things public to facilitate reuse elsewhere, but not adding code that's dead for us.

stuartnelson3 · 2019-09-04T17:47:58Z

the underlying memberlist library handles adding new peers when a new member connects. For example, if you start up 3 instances of AM that know only about each other, starting a 4th member later who knows of any of the other 3 instances will result in them all being in the same cluster.

I would suggest building on top of memberlist, like what was done in AM.

jacksontj · 2019-09-04T17:48:14Z

So is your suggestion to fork that library out? If so, would you guys want to maintain that in prom or should I just move it to my personal space?

cc @fabxc since you did the initial implementation -- might have some opinions.

jacksontj · 2019-09-04T17:50:15Z

@stuartnelson3 I'm working on migrating another project from weaveworks/mesh -> memberlist (kubernetes/kops#7436) which was the same transition AM went through (#1232). This cluster library already does what I want with a few small exceptions:

not able to add peers after start (this PR)
all metrics are hard-coded alertmanager_ (i can live with this)

So as mentioned before I'm happy to fork out the library (as it seems very generally useful) -- I just figured this would be something AM would be interested in keeping (or maybe just as another repo in the prometheus project?).

brian-brazil · 2019-09-04T17:54:03Z

all metrics are hard-coded alertmanager_ (i can live with this)

If you were to use this library, this would be correct as it's code from the AM. Metric names don't change due to being in a different binary.

jacksontj · 2019-09-04T17:56:31Z

all metrics are hard-coded alertmanager_ (i can live with this)

If you were to use this library, this would be correct as it's code from the AM. Metric names don't change due to being in a different binary.

Fair enough; so really the piece I'm missing to use the library is this single method -- so the question is would you be okay to merge this? Or should I split the cluster stuff out?

stuartnelson3 · 2019-09-04T17:56:51Z

not able to add peers after start (this PR)

I'm not sure what you mean by this. You can already add more peers without having to stop and restart the cluster. Every new instance you start, as long as it knows the address of one other running instance, will join the cluster that running instance is in.

jacksontj · 2019-09-04T17:58:55Z

not able to add peers after start (this PR)

I'm not sure what you mean by this. You can already add more peers without having to stop and restart the cluster. Every new instance you start, as long as it knows the address of one other running instance, will join the cluster that running instance is in.

The issue comes when there is some sort of split-brain. If the cluster splits there is no re-discovery period that will get them to eventually un-split :) So in my usage we have a background goroutine that does that rediscover every so often to re-add missing nodes that we know about (from our seed list).

Old implementation for reference: https://github.com/kubernetes/kops/blob/master/protokube/pkg/gossip/mesh/gossip.go#L114

jacksontj · 2019-09-04T18:22:30Z

And to clarify more Peer.refresh() sorta does this (https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L416) but you can never change the initial known peers list -- which is a problem for me as I have an external source for the seed list which updates.

stuartnelson3 · 2019-09-04T19:44:12Z

whenever a peer joins, be it from the initial list or joining later, it is added to the peers slice via the callback executed by memberlist: https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L447

if connection to a peer is lost, peerLeave() is called, the peer is added to the failedPeers slice (https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L488); attempts to reconnect to failed peers are made periodically: https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L396. This is intended to handle connection loss (be it from network partition or whatever).

if this is not working, please file a bug report so we can fix it

jacksontj · 2019-09-04T20:21:54Z

This reconnect is done until a timeout where it is then removed (https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L245) -- which would mean if the split brain lasted for longer than the timeout the cluster would require manual intervention to restart some peers to get them re-joining. In my particular case removing that timeout is not viable either, as the nodes come and go pretty regularly -- so I only want some nodes (order 10 -- as opposed to cluster size of 1k+) to always attempt reconnection.

stuartnelson3 · 2019-09-04T20:50:46Z

this doesn't really fit in with the alertmanager's code, then. I would recommend either forking what we have, or looking at the code in serf, since the code here is just a lightly adapted version of that.

jacksontj · 2019-09-04T20:57:45Z

I did take a look at serf, but it seems to be too opinionated for my use-case; Since this doesn't seem to be of interest I'll just fork the cluster lib into a separate repo.

jacksontj · 2019-09-05T16:34:37Z

For anyone else that has a similar need my fork is at https://github.com/jacksontj/memberlistmesh

jacksontj mentioned this pull request Sep 4, 2019

Memberlist gossip implementation kubernetes/kops#7521

Merged

5 tasks

AddPeer method to add additional peers after initial startup

b1cce6b

This way you can add more peers without having to stop/start the cluster. Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

jacksontj force-pushed the cluster_refactor branch from f0141de to b1cce6b Compare September 4, 2019 17:41

stuartnelson3 closed this Sep 5, 2019

olemarkus mentioned this pull request May 1, 2020

Protokube tries to connect to removed hosts kubernetes/kops#8296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AddPeer method to add additional peers after initial startup #2019

AddPeer method to add additional peers after initial startup #2019

jacksontj commented Sep 4, 2019 •

edited

Loading

brian-brazil commented Sep 4, 2019

jacksontj commented Sep 4, 2019 •

edited

Loading

brian-brazil commented Sep 4, 2019

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019

jacksontj commented Sep 4, 2019

brian-brazil commented Sep 4, 2019

jacksontj commented Sep 4, 2019

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019 •

edited

Loading

jacksontj commented Sep 4, 2019 •

edited

Loading

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019

jacksontj commented Sep 5, 2019

AddPeer method to add additional peers after initial startup #2019

AddPeer method to add additional peers after initial startup #2019

Conversation

jacksontj commented Sep 4, 2019 • edited Loading

brian-brazil commented Sep 4, 2019

jacksontj commented Sep 4, 2019 • edited Loading

brian-brazil commented Sep 4, 2019

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019

jacksontj commented Sep 4, 2019

brian-brazil commented Sep 4, 2019

jacksontj commented Sep 4, 2019

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019 • edited Loading

jacksontj commented Sep 4, 2019 • edited Loading

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019

stuartnelson3 commented Sep 4, 2019

jacksontj commented Sep 4, 2019

jacksontj commented Sep 5, 2019

jacksontj commented Sep 4, 2019 •

edited

Loading

jacksontj commented Sep 4, 2019 •

edited

Loading

jacksontj commented Sep 4, 2019 •

edited

Loading

jacksontj commented Sep 4, 2019 •

edited

Loading