Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Introducing RecoverNonWriteableMaster flag #1332

Merged
merged 10 commits into from
Apr 13, 2021

Conversation

shlomi-noach
Copy link
Collaborator

Fixes #1328

Replaces #1330

This PR introduces RecoverNonWriteableMaster, bool, defaults false for backward compatibility. If enabled, then a scenario where a cluster's primary is found to be read-only is treated as a failure analysis: we reuse NoWriteableMasterStructureWarning given some additional constraints.

Furthermore, when RecoverNonWriteableMaster is enabled, then NoWriteableMasterStructureWarning is an actionable failure, and orchestrator proceeds to run a recovery. The recovery is to disable read_only (and super_read_only, assuming UseSuperReadOnly is also enabled).

Note master terminology is subject to change in a future major release.

…ysis and recovery

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…tion treating it as error

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Collaborator Author

Fixes #865

@liortamari
Copy link

@shlomi-noach thanks a lot for this.
May I ask when the plan is to merge and release? I want to know if I would need to build my own version to fit the project's timeline.

@shlomi-noach
Copy link
Collaborator Author

@liortamari I estimate a week or so. There's no blockers per-se, just getting feedback from users. Were you able to try it out?

@liortamari
Copy link

@shlomi-noach yes it does work on my env.
I wonder though, why, in the audit-recovery, I don't have

No PreFailoverProcesses hooks to run

but i do have

No PostFailoverProcesses hooks to run

@shlomi-noach
Copy link
Collaborator Author

@liortamari is that question specific to this PR/change? I expect it is not because the hooks logic is generic for all recovery types. If you can confirm that, then please open a new issue with said question. While we're here, check your PostFailoverProcesses configuration. If it's non-empty, then everything makes sense.

@liortamari
Copy link

@shlomi-noach I believe it is specific to this case, but if I am mistaken let me know, and I will open a new issue.
To verify my question i did the following test:

  • checkout the branch
  • edit the orchestrator-ci-env.conf.json with "RecoverNonWriteableMaster": true
  • run the CI script with 'scripts/dock system'
  • run 'orchestrator-client -c force-master-failover -a ci' (cause 10112 promotion)
  • run 'mysql -S /tmp/mysql_sandbox10112.sock --user=ci --password=ci -e "set global read_only=1;"'
  • run 'cat /tmp/recovery.log'

This is the output I got:
DeadMaster did trigger both the PreFailoverProcesses ("Will recover from...") and PostFailoverProcessesso.
But NoWriteableMasterStructureWarning triggered only PostFailoverProcesses.
That said I don't plan to use any hooks currently but it is a bit counter-intuitive so i just want to make sure everything works as planned.

Detected DeadMaster on 127.0.0.1:10111. Affected replicas: 3
Will recover from DeadMaster on 127.0.0.1:10111
Recovered from DeadMaster on 127.0.0.1:10111. Failed: 127.0.0.1:10111; Promoted: 127.0.0.1:10112
(for all types) Recovered from DeadMaster on 127.0.0.1:10111. Failed: 127.0.0.1:10111; Successor: 127.0.0.1:10112
Detected NoWriteableMasterStructureWarning on 127.0.0.1:10112. Affected replicas: 2
(for all types) Recovered from NoWriteableMasterStructureWarning on 127.0.0.1:10112. Failed: 127.0.0.1:10112; Successor: 127.0.0.1:10112

@shlomi-noach
Copy link
Collaborator Author

@liortamari I appreciate your analysis of the flow. You are of course correct. PreFailoverProcesses execution isn't so generic after all. Fixing.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Collaborator Author

PreFailoverProcesses now properly executed before NoWriteableMasterStructureWarning failover.

@liortamari
Copy link

Tested the updated branch.
Works as expected.
Thank you.

@llintes
Copy link

llintes commented Apr 13, 2021

We have done the following tests:

  • Set read_only or super_read_only = ON on the primary. Confirmed that NoWriteableMasterStructureWarning failure was detected and recover was started.
  • Killed MySQL task (mysqld) and let mysqld_safe restart it fast. DeadMaster recovery is not detected/triggered but on MySQL primary restart the NoWriteableMasterStructureWarning is detected and corresponding recovery triggered.
    The end result for both cases is that primary becomes writable when RecoverNonWriteableMaster Orchestrator config parameter is set to true and primary MySQL instance is set to RO. It worked for both read_only and super_read_only mysql config params
  • I have seen only one case during testing where as a result of a rapid restart of mysqld instance, it still detected a RecoverNonWriteableMaster but in combination with a DeadMaster scenario detected, it chose a different successor. So, the result was a failover which is OK considering that the main issue here was a DeadMaster scenario.
  • Posting below the recovery steps for RecoverNonWriteableMaster event:
    2021-04-13T13:32:06Z	Running 1 PreFailoverProcesses hooks
    2021-04-13T13:32:06Z	Running PreFailoverProcesses hook 1 of 1: echo 'Will recover from 
     NoWriteableMasterStructureWarning on MASTER_NODE:PORT' >> /tmp/recovery.log
    2021-04-13T13:32:06Z	Completed PreFailoverProcesses hook 1 of 1 in 7.786051ms
    2021-04-13T13:32:06Z	done running PreFailoverProcesses hooks
    2021-04-13T13:32:06Z	Running 1 PostFailoverProcesses hooks
    2021-04-13T13:32:06Z	Running PostFailoverProcesses hook 1 of 1: echo '(for all types) Recovered from 
     NoWriteableMasterStructureWarning on MASTER_NODE:PORT. Failed: MASTER_NODE:PORT; Successor: 
      'MASTER_NODE:PORT' >> /tmp/recovery.log 
    2021-04-13T13:32:06Z	Completed PostFailoverProcesses hook 1 of 1 in 5.274265ms
    2021-04-13T13:32:06Z	done running PostFailoverProcesses hooks
    2021-04-13T13:32:06Z	Waiting for 0 postponed functions
    2021-04-13T13:32:06Z	Executed 0 postponed functions
     ```   
    

@shlomi-noach
Copy link
Collaborator Author

@llintes thank you for the feedback! I read that as good results. In the mixed DeadMaster + RecoverNonWriteableMaster, the DeadMaster analysis takes precedence and the behavior is expected.

@shlomi-noach shlomi-noach merged commit 96533fd into master Apr 13, 2021
@shlomi-noach shlomi-noach deleted the recover-non-writeable-master branch April 13, 2021 15:32
@yotles
Copy link

yotles commented May 26, 2021

@shlomi-noach Hi. Could you please describe when it planning for this feature get in release? Thanks.

@shlomi-noach
Copy link
Collaborator Author

Good reminder, I should publish a release!

@shlomi-noach
Copy link
Collaborator Author

Please see https://github.com/openark/orchestrator/releases/tag/v3.2.5

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Orchestrator to monitor RW status of the primary node.
4 participants