Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: vtorc goes from a DeadPrimary -> ClusterHasNoPrimary; promotes final replica #13284

Closed
olyazavr opened this issue Jun 9, 2023 · 5 comments · Fixed by #13587
Closed
Labels
Component: VTorc Vitess Orchestrator integration Type: Bug

Comments

@olyazavr
Copy link
Contributor

olyazavr commented Jun 9, 2023

Overview of the Issue

When you have a primary and a replica, and then you kill that primary, vtorc will first detect DeadPrimary, but then not intervene with ERS because it cannot find a suitable candidate that would still satisfy the semi_sync durability policy. However, it soon transitions to detecting ClusterHasNoPrimary (likely when the primary server completely terminates) and then performs PRS on that final replica, meaning you have one primary and no replicas, which is not a functional situation when running with semi_sync. After this, it detects LockedSemiSyncPrimary and cannot get out of it.

One workaround I have found is to introduce a patch to block PRS from choosing a new primary similar to this check, where we check if len(validTablets) == 1 && !currentlyHasAPrimary and if so, return nothing.

Reproduction Steps

  1. Set up a keyspaceshard with one primary and one replica
  2. Kill that primary (eg. mysql -- pkill -9 mysqld and then shut down the server abruptly)
  3. Watch as vtorc first finds DeadPrimary, then ClusterHasNoPrimary, and promotes the replica to primary

Binary Version

Vitess 16

Operating System and Environment details

Centos8, Linux 5.4.141-hs22.el8.x86_64

Log Fragments

No response

@olyazavr olyazavr added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jun 9, 2023
@GuptaManan100
Copy link
Member

Could you share the VTOrc and vttablet logs? I couldn't reproduce the problem locally following the steps listed.

@olyazavr
Copy link
Contributor Author

vtorc logs: https://gist.github.com/olyazavr/a42483e8755add6a6ccee3aee153682a (omitted slack hook content for brevity)
vttablet logs: https://gist.github.com/olyazavr/57a450705eb6c7afcf3e9856745e3a06

@mattlord mattlord added Component: VTorc Vitess Orchestrator integration and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jul 5, 2023
@mattlord
Copy link
Contributor

mattlord commented Jul 5, 2023

cc @vitessio/vtorc

@GuptaManan100
Copy link
Member

@olyazavr I don't have time right now to be able to fix this issue, but if you want to take it up, I'd be happy to review the PR. I can review any pseudocode/ideas that you want to talk about too.

@GuptaManan100
Copy link
Member

@olyazavr Never mind, I started working on this today. Draft PR - #13587

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Bug
Projects
None yet
3 participants