Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

improving and fixing failure detection #196

Merged
merged 7 commits into from
Jun 8, 2017
Merged

Conversation

shlomi-noach
Copy link
Collaborator

This PR will improve upon the failure detection process, by:

  • prioritizing detection by severity
  • more useful output

@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 08:01 Active
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 14:04 Active
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 16:05 Active
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 6, 2017 16:08 Active
if topologyRecovery != nil {
promotedReplicaKey = topologyRecovery.SuccessorKey
}
} else {
go executeCheckAndRecoverFunction(analysisEntry, candidateInstanceKey, false, skipProcesses)
analysisEntry := analysisEntry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the proper behavior here would be to pass analysisEntry into the goroutine:

go func(analysisEntry inst.ReplicationAnalysis) {
    // call things in here
}(analysisEntry)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickvanw thank you; I became accustomed to this paradigm following https://golang.org/doc/effective_go.html#channels (req := req).
I thought at first it looked funny, but then it is also very explicit and clear.

@shlomi-noach shlomi-noach temporarily deployed to production June 7, 2017 05:12 Inactive
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 7, 2017 05:27 Active
@shlomi-noach
Copy link
Collaborator Author

Time to elaborate. This PR fixes a subtle bug which does not manifest consistently. The subtle fix is on https://github.com/github/orchestrator/pull/196/files#diff-86933c5afedc1f1a02e011c53af7b039R1366

The executeCheckAndRecoverFunction() function was called within a goroutine even while analysisEntry was iterating replicationAnalysis, which means by the time the goroutine actually executed, analysisEntry would have switched to a/the next value.

This doesn't manifest in a consistent manner because of the nature of unpredictable concurrency. It did manifest more on prior array entries than on latter array entries.

When this PR is merged (being hammered right now) I'll release a version as this is a critical recovery fix. I'm surprised it hasn't surfaced sooner, to be honest.

@shlomi-noach
Copy link
Collaborator Author

This PR adds a distinction between an actionably-recoverable problem (e.g. DeadMaster) and one that hasn't got a recovering action (e.g. UnreachableMaster).

It allows orchestrator to register up to one actionable and one non-actionable entry in the backend topology_failure_detection table.

Detection is known to be of an escalating nature. A dying master may for example first be diagnosed as UnreachableMaster and then as DeadMaster. While recoveries work for such escalations (pending the subtle fix in this PR), detection only shows the first met analysis, and this causes for lacking visibility.

With the changes in this PR we will get more information via topology_failure_detection and, as result, via /api/audit-failure-detection and /web/audit-failure-detection.

@shlomi-noach
Copy link
Collaborator Author

an also more logging to be able to figure out what's up with detection at any step. And logging is Info rather than Debug because it's important™.

@shlomi-noach shlomi-noach deployed to production/github-mysqlutil June 7, 2017 11:53 Active
@shlomi-noach
Copy link
Collaborator Author

Recent commit also intentionally changes recovery iteration to random order.

@shlomi-noach shlomi-noach merged commit a74c07a into master Jun 8, 2017
@shlomi-noach shlomi-noach deleted the detection-fixes branch June 8, 2017 06:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants