-
Notifications
You must be signed in to change notification settings - Fork 937
improving and fixing failure detection #196
Conversation
go/logic/topology_recovery.go
Outdated
if topologyRecovery != nil { | ||
promotedReplicaKey = topologyRecovery.SuccessorKey | ||
} | ||
} else { | ||
go executeCheckAndRecoverFunction(analysisEntry, candidateInstanceKey, false, skipProcesses) | ||
analysisEntry := analysisEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the proper behavior here would be to pass analysisEntry
into the goroutine:
go func(analysisEntry inst.ReplicationAnalysis) {
// call things in here
}(analysisEntry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickvanw thank you; I became accustomed to this paradigm following https://golang.org/doc/effective_go.html#channels (req := req
).
I thought at first it looked funny, but then it is also very explicit and clear.
Time to elaborate. This PR fixes a subtle bug which does not manifest consistently. The subtle fix is on https://github.com/github/orchestrator/pull/196/files#diff-86933c5afedc1f1a02e011c53af7b039R1366 The This doesn't manifest in a consistent manner because of the nature of unpredictable concurrency. It did manifest more on prior array entries than on latter array entries. When this PR is merged (being hammered right now) I'll release a version as this is a critical recovery fix. I'm surprised it hasn't surfaced sooner, to be honest. |
This PR adds a distinction between an actionably-recoverable problem (e.g. It allows Detection is known to be of an escalating nature. A dying master may for example first be diagnosed as With the changes in this PR we will get more information via |
an also more logging to be able to figure out what's up with detection at any step. And logging is |
Recent commit also intentionally changes recovery iteration to random order. |
This PR will improve upon the failure detection process, by: