-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a recovery mechanism for a split gossip cluster #2134
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2134 +/- ##
=========================================
Coverage ? 40.49%
=========================================
Files ? 139
Lines ? 22467
Branches ? 0
=========================================
Hits ? 9097
Misses ? 12031
Partials ? 1339
Continue to review full report at Codecov.
|
networkdb/delegate.go
Outdated
@@ -397,7 +392,7 @@ func (d *delegate) GetBroadcasts(overhead, limit int) [][]byte { | |||
|
|||
func (d *delegate) LocalState(join bool) []byte { | |||
if join { | |||
// Update all the local node/network state to a new time to | |||
// Update all the local node/network state to a new time to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this happened... Fat finger :)
I'll fix it
networkdb/cluster.go
Outdated
@@ -203,7 +206,7 @@ func (nDB *NetworkDB) clusterJoin(members []string) error { | |||
|
|||
if _, err := mlist.Join(members); err != nil { | |||
// In case of failure, keep retrying join until it succeeds or the cluster is shutdown. | |||
go nDB.retryJoin(members, nDB.stopCh) | |||
go nDB.retryJoin(nDB.ctx, members) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this change I don't think launching this routine makes anymore sense. If there is a failure here, the next attempt will happen in 60s or less with the other codepath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, clusterJoin
is only called when agent is initialized to join the boot_strap nodes; unless it's time critical and we can't wait 60 seconds, it's now redundant.
Happy to take it out.
networkdb/cluster.go
Outdated
ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second) | ||
defer cancel() | ||
nDB.retryJoin(ctx, bootStrapIPs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra new line
networkdb/cluster.go
Outdated
nDB.RUnlock() | ||
logrus.Debugf("rejoinClusterBootStrap, calling cluster join with bootStrap %v", bootStrapIPs) | ||
// All bootStrap nodes are not in the cluster, call memberlist join | ||
ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should put this time of 10s at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roger that
} | ||
// If the node is not known from memberlist we cannot process save any state of it else if it actually | ||
// dies we won't receive any notification and we will remain stuck with it | ||
if _, ok := nDB.nodes[nEvent.NodeName]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the nDB.findNode
check necessary above [L28] given this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only to filter based on the lamport clock
// rejoinClusterBootStrap is called periodically to check if all bootStrap nodes are active in the cluster, | ||
// if not, call the cluster join to merge 2 separate clusters that are formed when all managers | ||
// stopped/started at the same time | ||
func (nDB *NetworkDB) rejoinClusterBootStrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it make sense to also attempt to refresh nDB.bootStrapIP
here through a call to something like GetRemoteAddressList
in case the list has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcrisciani and I had a discussion about other use cases that can lead us to the same "split cluster" and the possibility of re-checking/updating the bootStrap IPs (through GetRemoteAddressList)
1- Re-ip the managers .
2- Demote/Promote managers/workers .
The first one is not an issue as a re-ip to all managers will force all nodes in the cluster to restart .
As for 2), customers will only hit it if they demoted all managers in the cluster + restarting those managers without restarting the workers... We felt like this is a very edge case.
This has been said, If you guys think we should still refresh the bootStrapIP, then we can add the logic to the newly introduced rejoinClusterBootStrap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 2 problems can be handled separately. Memberist with this PR will honor the bootstrapIP that got passed at the beginning.
The second issue will need a periodic check of the GetRemoteAddressList
and if the list change, the routine have to call the networkDB.Join
with the new bootstrapIPs
0a10a03
to
e0dd4ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
networkdb/cluster.go
Outdated
bootStrapIPs = append(bootStrapIPs, bootIP.String()) | ||
} | ||
nDB.RUnlock() | ||
// Not all bootStrap nodes are in the cluster, call memberlist join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean None of the bootStrap nodes are in the cluster
rather than Not all ...
in the comment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. I will reword it and push the update.
Thx
LGTM, if you fix the comment that @ddebroy mentioned we can merge |
Signed-off-by: Dani Louca <dani.louca@docker.com>
Thx guys. PR is updated with latest change request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
bootStrapIPs := make([]string, 0, len(nDB.bootStrapIP)) | ||
for _, bootIP := range nDB.bootStrapIP { | ||
for _, node := range nDB.nodes { | ||
if node.Addr.Equal(bootIP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dani-docker thinking more about this, I guess here it's missing the check that the IP is != from current node IP else this fix won't work for the managers. Every manager will see itself in the list and won't try to reconnect
Steps To Reproduce:
Expected Results:
Actual Results:
overlay
and unpredicted networking issues between containers on the 2 different clusters.Workaround:
or
/join
endpointPR fix:
This PR adds a go routine that runs every minute and checks that at least one node from the
bootStrap
list (ie: managers nodes) is part of the cluster, if we couldn't find any, then we attempt a/join
for 10 seconds .Note:
While testing, I ran into an issue, that I couldn't reliably reproduce;
a
leave
event was followed by a false positivejoin
event was received by aworker
when themanager
node was down, this caused an inconsistency in thenDB.nodes
which caused the logic in this PR to fail.@fcrisciani recommended the below change and he'll be looking into the problem in more details.
Signed-off-by: Dani Louca dani.louca@docker.com