Instance check leads to UnreachableMaster (LastCheckValid: false) if instance check takes longer than 1s #1367

binwiederhier · 2021-06-15T20:28:01Z

Our MySQL hosts are under considerable load, to the point that quite frequently, the analysis returns LastCheckValid: false, even though the host is up and responsive:

analysis: ClusterName: ...:3306, IsMaster: true, LastCheckValid: false, LastCheckPartialSuccess: true, 
CountReplicas: 2, CountValidReplicas: 2, CountValidReplicatingReplicas: 2, CountLaggingReplicas: 0, 
CountDelayedReplicas: 0, CountReplicasFailingToConnectToMaster: 0

After looking at Wireshark dumps, it seems that our hosts sometimes take slightly longer than 1s to perform these checks (2-3s, sometimes even longer), which is arguably not great, but still ok for us. From my understanding of the code, if a check takes longer than 1s (hardcoded), Orchestrator considers the host to be down, which may lead to emergent actions and eventually to failovers if those fail.

I've deduced this from these parts of the code:

I propose to making this 1s timeout configurable as ReasonableInstanceCheckSeconds.

Did I get all of this right? What do you think about the proposal? I am happy to provide all the info that's needed here.

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2021-06-16T05:28:29Z

This makes sense. I remember going to opposite direction, this used to be configurable and moved to a constant in an attempt to reduce variance and configuration complexity. But honestly my memory may not serve me well. Thank you for submitting a PR!

binwiederhier · 2021-06-16T20:44:45Z

Closing this as the PR is merged.

binwiederhier mentioned this issue Jun 15, 2021

ReasonableInstanceCheckSeconds #1368

Merged

binwiederhier closed this as completed Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance check leads to UnreachableMaster (LastCheckValid: false) if instance check takes longer than 1s #1367

Instance check leads to UnreachableMaster (LastCheckValid: false) if instance check takes longer than 1s #1367

binwiederhier commented Jun 15, 2021

shlomi-noach commented Jun 16, 2021

binwiederhier commented Jun 16, 2021

Instance check leads to UnreachableMaster (LastCheckValid: false) if instance check takes longer than 1s #1367

Instance check leads to UnreachableMaster (LastCheckValid: false) if instance check takes longer than 1s #1367

Comments

binwiederhier commented Jun 15, 2021

shlomi-noach commented Jun 16, 2021

binwiederhier commented Jun 16, 2021