You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you have a primary and a replica, and then you kill that primary, vtorc will first detect DeadPrimary, but then not intervene with ERS because it cannot find a suitable candidate that would still satisfy the semi_sync durability policy. However, it soon transitions to detecting ClusterHasNoPrimary (likely when the primary server completely terminates) and then performs PRS on that final replica, meaning you have one primary and no replicas, which is not a functional situation when running with semi_sync. After this, it detects LockedSemiSyncPrimary and cannot get out of it.
One workaround I have found is to introduce a patch to block PRS from choosing a new primary similar to this check, where we check if len(validTablets) == 1 && !currentlyHasAPrimary and if so, return nothing.
Reproduction Steps
Set up a keyspaceshard with one primary and one replica
Kill that primary (eg. mysql -- pkill -9 mysqld and then shut down the server abruptly)
Watch as vtorc first finds DeadPrimary, then ClusterHasNoPrimary, and promotes the replica to primary
Binary Version
Vitess 16
Operating System and Environment details
Centos8, Linux 5.4.141-hs22.el8.x86_64
Log Fragments
No response
The text was updated successfully, but these errors were encountered:
@olyazavr I don't have time right now to be able to fix this issue, but if you want to take it up, I'd be happy to review the PR. I can review any pseudocode/ideas that you want to talk about too.
Overview of the Issue
When you have a primary and a replica, and then you kill that primary, vtorc will first detect DeadPrimary, but then not intervene with ERS because it cannot find a suitable candidate that would still satisfy the semi_sync durability policy. However, it soon transitions to detecting ClusterHasNoPrimary (likely when the primary server completely terminates) and then performs PRS on that final replica, meaning you have one primary and no replicas, which is not a functional situation when running with semi_sync. After this, it detects LockedSemiSyncPrimary and cannot get out of it.
One workaround I have found is to introduce a patch to block PRS from choosing a new primary similar to this check, where we check
if len(validTablets) == 1 && !currentlyHasAPrimary
and if so, return nothing.Reproduction Steps
mysql -- pkill -9 mysqld
and then shut down the server abruptly)Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: