Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a recovery mechanism for a split gossip cluster #2134

Merged
merged 1 commit into from
Apr 23, 2018

Conversation

dani-docker
Copy link
Contributor

@dani-docker dani-docker commented Apr 6, 2018

Steps To Reproduce:

  • 3 managers 3 workers cluster
  • Stop all managers in the cluster (leave the workers running)
  • Wait a minute and start all managers

Expected Results:

  • All nodes converge into a single gossip cluster

Actual Results:

  • The workers and the managers ended up on 2 different gossip clusters. This causes overlay and unpredicted networking issues between containers on the 2 different clusters.

Workaround:

  • Restart "All" the workers to force re-joining the manager cluster
    or
  • To avoid a restart, use the diagnostic tool with the /join endpoint

PR fix:
This PR adds a go routine that runs every minute and checks that at least one node from the bootStrap list (ie: managers nodes) is part of the cluster, if we couldn't find any, then we attempt a /join for 10 seconds .

Note:
While testing, I ran into an issue, that I couldn't reliably reproduce;
a leave event was followed by a false positive join event was received by a worker when the manager node was down, this caused an inconsistency in the nDB.nodes which caused the logic in this PR to fail.

@fcrisciani recommended the below change and he'll be looking into the problem in more details.


	// If the node is not known from memberlist we cannot process save any state of it else if it actually
	// dies we won't receive any notification and we will remain stuck with it
	if _, ok := nDB.nodes[nEvent.NodeName]; !ok {
		logrus.Error("node: %s is unknown to memberlist", nEvent.NodeName)
		return false
	}

Signed-off-by: Dani Louca dani.louca@docker.com

@codecov-io
Copy link

codecov-io commented Apr 6, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@c15b372). Click here to learn what that means.
The diff coverage is 34.48%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2134   +/-   ##
=========================================
  Coverage          ?   40.49%           
=========================================
  Files             ?      139           
  Lines             ?    22467           
  Branches          ?        0           
=========================================
  Hits              ?     9097           
  Misses            ?    12031           
  Partials          ?     1339
Impacted Files Coverage Δ
networkdb/networkdb.go 67.8% <ø> (ø)
networkdb/delegate.go 75% <100%> (ø)
networkdb/cluster.go 60.75% <26.92%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c15b372...744334d. Read the comment docs.

@fcrisciani fcrisciani requested a review from ddebroy April 6, 2018 22:04
@@ -397,7 +392,7 @@ func (d *delegate) GetBroadcasts(overhead, limit int) [][]byte {

func (d *delegate) LocalState(join bool) []byte {
if join {
// Update all the local node/network state to a new time to
// Update all the local node/network state to a new time to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this happened... Fat finger :)
I'll fix it

@@ -203,7 +206,7 @@ func (nDB *NetworkDB) clusterJoin(members []string) error {

if _, err := mlist.Join(members); err != nil {
// In case of failure, keep retrying join until it succeeds or the cluster is shutdown.
go nDB.retryJoin(members, nDB.stopCh)
go nDB.retryJoin(nDB.ctx, members)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change I don't think launching this routine makes anymore sense. If there is a failure here, the next attempt will happen in 60s or less with the other codepath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, clusterJoin is only called when agent is initialized to join the boot_strap nodes; unless it's time critical and we can't wait 60 seconds, it's now redundant.
Happy to take it out.

ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second)
defer cancel()
nDB.retryJoin(ctx, bootStrapIPs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra new line

nDB.RUnlock()
logrus.Debugf("rejoinClusterBootStrap, calling cluster join with bootStrap %v", bootStrapIPs)
// All bootStrap nodes are not in the cluster, call memberlist join
ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put this time of 10s at the top

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger that

}
// If the node is not known from memberlist we cannot process save any state of it else if it actually
// dies we won't receive any notification and we will remain stuck with it
if _, ok := nDB.nodes[nEvent.NodeName]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the nDB.findNode check necessary above [L28] given this check?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only to filter based on the lamport clock

// rejoinClusterBootStrap is called periodically to check if all bootStrap nodes are active in the cluster,
// if not, call the cluster join to merge 2 separate clusters that are formed when all managers
// stopped/started at the same time
func (nDB *NetworkDB) rejoinClusterBootStrap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it make sense to also attempt to refresh nDB.bootStrapIP here through a call to something like GetRemoteAddressList in case the list has changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcrisciani and I had a discussion about other use cases that can lead us to the same "split cluster" and the possibility of re-checking/updating the bootStrap IPs (through GetRemoteAddressList)
1- Re-ip the managers .
2- Demote/Promote managers/workers .

The first one is not an issue as a re-ip to all managers will force all nodes in the cluster to restart .
As for 2), customers will only hit it if they demoted all managers in the cluster + restarting those managers without restarting the workers... We felt like this is a very edge case.

This has been said, If you guys think we should still refresh the bootStrapIP, then we can add the logic to the newly introduced rejoinClusterBootStrap

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 2 problems can be handled separately. Memberist with this PR will honor the bootstrapIP that got passed at the beginning.
The second issue will need a periodic check of the GetRemoteAddressList and if the list change, the routine have to call the networkDB.Join with the new bootstrapIPs

@dani-docker dani-docker force-pushed the esc-532 branch 2 times, most recently from 0a10a03 to e0dd4ed Compare April 11, 2018 14:59
Copy link
Contributor

@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bootStrapIPs = append(bootStrapIPs, bootIP.String())
}
nDB.RUnlock()
// Not all bootStrap nodes are in the cluster, call memberlist join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean None of the bootStrap nodes are in the cluster rather than Not all ... in the comment, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. I will reword it and push the update.
Thx

@fcrisciani
Copy link

LGTM, if you fix the comment that @ddebroy mentioned we can merge

Signed-off-by: Dani Louca <dani.louca@docker.com>
@dani-docker
Copy link
Contributor Author

Thx guys. PR is updated with latest change request.

Copy link

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bootStrapIPs := make([]string, 0, len(nDB.bootStrapIP))
for _, bootIP := range nDB.bootStrapIP {
for _, node := range nDB.nodes {
if node.Addr.Equal(bootIP) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dani-docker thinking more about this, I guess here it's missing the check that the IP is != from current node IP else this fix won't work for the managers. Every manager will see itself in the list and won't try to reconnect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants