Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Supporting UnreachableIntermediateMasterWithLaggingReplicas #1005

Merged
merged 6 commits into from
Nov 24, 2019

Conversation

shlomi-noach
Copy link
Collaborator

Fixes #999

This PR introduces the UnreachableIntermediateMasterWithLaggingReplicas analysis. As the name suggests, when orchestrator cannot reach an intermediate master, and in addition all of its replicas are lagging -- this analysis is made.

The remediation is similar to that of UnreachableMasterWithLaggingReplicas: orchestrator emergently restarts replication IO_thread on all replicas of said intermediate master.

In scenarios like the one depicted in #999, the replicas then quick identify themselves to be broken. Thus, a next failure detection by orchestrator is expected to analyze a DeadIntermediateMaster and kick a failover.

cc @jfg956

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Orchestrator not detecting intermediate master failure with relay_log_space_limit.
1 participant