You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We upgraded a large database to v16 recently. During the rollout, errors were served to the app for ~30 seconds.
The root cause seems to be that the upgrade of _vt schema during PlannedReparent was blocked by semi-sync.
Reproduction Steps
Upgrade a 3+ tablet cluster with semi-sync enabled from v15 to v16.
Binary Version
16.0.0+
Operating System and Environment details
Any
Log Fragments
2023-06-26 20:33:46.339
I0626 20:33:46.338969 1 replication.go:586] Setting semi-sync mode: primary=true, replica=true
2023-06-26 20:33:46.339
I0626 20:33:46.339255 1 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 1, GLOBAL rpl_semi_sync_slave_enabled = 1
2023-06-26 20:33:46.339
I0626 20:33:46.339689 1 tm_state.go:186] Changing Tablet Type: PRIMARY for cell:"redacted" uid:redacted
2023-06-26 20:33:46.358
I0626 20:33:46.357886 1 syslogger.go:129] <redacted> [tablet] updated
2023-06-26 20:33:46.371
I0626 20:33:46.371122 1 sidecardb.go:408] Applying DDL for table views:
2023-06-26 20:33:46.371
CREATE TABLE IF NOT EXISTS `_vt`.`views` (
2023-06-26 20:33:46.371
`TABLE_SCHEMA` varchar(64) NOT NULL,
2023-06-26 20:33:46.371
`TABLE_NAME` varchar(64) NOT NULL,
2023-06-26 20:33:46.371
`CREATE_STATEMENT` longtext NOT NULL,
2023-06-26 20:33:46.371
`UPDATED_AT` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
2023-06-26 20:33:46.371
PRIMARY KEY (`TABLE_SCHEMA`, `TABLE_NAME`)
2023-06-26 20:33:46.371
) ENGINE InnoDB
2023-06-26 20:33:48.215
I0626 20:33:48.215033 1 state_manager.go:682] Going unhealthy due to replication error: no replication status (errno 100) (sqlstate HY000)
2023-06-26 20:34:16.356
I0626 20:34:16.356183 1 sidecardb.go:357] createSidecarDB: _vt
The text was updated successfully, but these errors were encountered:
There is a 30 second gap here. What seems to have happened is that because we enable semi-sync before transitioning the tablet to primary, the creation of _vt schema gets blocked by semi-sync. We point replicas to the new primary only after the transition to primary so there is no tablet available to ACK the write. In the meantime, vtorc detects that the replicas are pointing to the wrong (old) primary, but can't do anything because of the shard lock being held by PRS. At the end of 30 seconds, the lock times out, vtorc fixes replication, and the DDL can proceed.
Overview of the Issue
We upgraded a large database to v16 recently. During the rollout, errors were served to the app for ~30 seconds.
The root cause seems to be that the upgrade of _vt schema during PlannedReparent was blocked by semi-sync.
Reproduction Steps
Upgrade a 3+ tablet cluster with semi-sync enabled from v15 to v16.
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: