Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set orc maintenance mode on tablet that is being promoted by PRS #8859

Merged
merged 2 commits into from
Oct 18, 2021

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Sep 21, 2021

Description

If you're using orchestrator with the RecoverNonWriteableMaster flag enabled then there's a small window of time where the tablet being promoted is the primary but its managed mysqld has not yet had read-only mode turned off. If orchestrator performs the non-writable-master check during that window then the shard gets into an unhealthy state.

This can occur because today we're setting the maintenance mode in orchestrator at the tablet level and doing so in DemotePrimary and StopReplication, but then skipping it in RepairReplication when Orchestrator is in an active recovery, which IT IS when orchestrator engages the RecoverNonWriteableMaster behavior.

So during a PlannedReparentShard the original primary that's being demoted is set in maintenance mode twice (in DemotePrimary and StopReplication) and while the other tablet is being promoted orchestrator sees its managed mysqld still being in read-only mode and thus begins an orchestrator recovery -- setting the ActiveShardRecovery state to true which in turn causes the maintenance mode to NOT be set on the other tablets during the tablet repair and the replication repair work is skipped.

This PR addresses the bug by also setting the tablet being promoted to maintenance mode and thus avoiding the chance of an errant orchestrator driven recovery in the middle of the vitess reparenting.

(We then later end maintenance on all tablets in the shard.)

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

@mattlord mattlord requested a review from sougou September 21, 2021 19:22
@mattlord mattlord self-assigned this Sep 21, 2021
@mattlord mattlord requested a review from gedgar September 21, 2021 19:28
@mattlord mattlord force-pushed the SetOrcMaintOnPromoteReplica branch 3 times, most recently from 115dc85 to 8291b4f Compare September 21, 2021 22:04
@mattlord mattlord marked this pull request as draft September 22, 2021 02:20
@mattlord mattlord force-pushed the SetOrcMaintOnPromoteReplica branch 2 times, most recently from 8348315 to a4e1b9d Compare October 7, 2021 17:45
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. We can approve/merge once it is ready for review.

@mattlord mattlord marked this pull request as ready for review October 11, 2021 21:22
@mattlord mattlord requested a review from deepthi October 11, 2021 21:23
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the SetOrcMaintOnPromoteReplica branch from a4e1b9d to b6f0f4e Compare October 11, 2021 21:38
Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see any harm in adding this.

@deepthi deepthi merged commit a7d542f into vitessio:main Oct 18, 2021
@deepthi deepthi deleted the SetOrcMaintOnPromoteReplica branch October 18, 2021 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants