Set orc maintenance mode on tablet that is being promoted by PRS #8859
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
If you're using orchestrator with the
RecoverNonWriteableMaster
flag enabled then there's a small window of time where the tablet being promoted is the primary but its managed mysqld has not yet had read-only mode turned off. If orchestrator performs the non-writable-master check during that window then the shard gets into an unhealthy state.This can occur because today we're setting the maintenance mode in orchestrator at the tablet level and doing so in
DemotePrimary
andStopReplication
, but then skipping it inRepairReplication
when Orchestrator is in an active recovery, which IT IS when orchestrator engages theRecoverNonWriteableMaster
behavior.So during a
PlannedReparentShard
the original primary that's being demoted is set in maintenance mode twice (inDemotePrimary
andStopReplication
) and while the other tablet is being promoted orchestrator sees its managed mysqld still being in read-only mode and thus begins an orchestrator recovery -- setting the ActiveShardRecovery state to true which in turn causes the maintenance mode to NOT be set on the other tablets during the tablet repair and the replication repair work is skipped.This PR addresses the bug by also setting the tablet being promoted to maintenance mode and thus avoiding the chance of an errant orchestrator driven recovery in the middle of the vitess reparenting.
(We then later end maintenance on all tablets in the shard.)
Related Issue(s)
Checklist