Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

pedroalb · 2020-11-02T13:56:38Z

Hi!

( This is an issue created after the follow-up discussion on the MySQL Community slack: https://mysqlcommunity.slack.com/archives/C9AB5JVNG/p1604228957063100 )

I am setting up automatic failovers on MariaDB and I chose to use orchestrator as the tool to help me there. I’m doing the tests and so far, it’s working great. However, testing data loss on failovers, I can’t seem to make FailMasterPromotionIfSQLThreadNotUpToDate or DelayMasterPromotionIfSQLThreadNotUpToDate work.

Test environment:

primary with GTID and semi-sync enabled (semi-sync timeout huge enough. I prefer to be down than have data loss).
secondaries with GTID and semi-sync enabled (relay_log_purge=0 and relay-log-recovery=0).

One of the tests I’m doing is to prove that orchestrator allows the candidate replica to apply all the relay logs before orchestrator resets the slave configurations and promotes it as primary.
So, I have sysbench running on a 4th node with enough workload to make replicas lag. Once they are lagging for, like 30 seconds, I shutdown the primary node.
Once the primary node is shutdown, orchestrator promotes one replica but doesn’t allow the replica to apply all the relay logs (with DelayMasterPromotionIfSQLThreadNotUpToDate ) or doesn’t stop the failover (with FailMasterPromotionIfSQLThreadNotUpToDate ).

Seems that when using master_use_gtid to current_pos (or slave_pos ) and both replica threads (i/o and sql) are restarted, i/o thread purges all relay logs and starts pulling from the last position that sql thread has applied.

In orchestrator, this happens when RestartReplicationQuick comes and restarts both threads here. Then, because i/o thread and sql thread have the same binlog coordinates, orchestrator thinks that replica is in sync and proceeds with the failover here .

To recap:

this has nothing to do with heartbeat/lag evaluation
in mariadb
upon master failure
orchestrator runs a RestartReplicationQuick (stop slave sql_thread, stop slave io_thread, start slave io_thread, start slave sql_thread). This happens specifically when replicas are lagging at the tim emaster is failed
causing IO thread to jump back to an earlier position (this only happens with mariadb)
causing both SQL and IO threads to point to same position
leading orchestrator to think SQL thread is up to date with IO thread
promoting the replica prematurely

Thank you!

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2020-11-02T14:16:38Z

Thank you. Also, @jfg956 points out that he has observed this behavior in MariaDB and reported it:

So, upon stop slave io_thread; start slave io_thead, MariaDB purges the relay logs, which is exactly what we don't want to happen if the master dies.

I believe orchestrator needs to have special behavior for MariaDB+GTID, where it does not invoke the function RestartReplicationQuick(). This is unfortunate, and orchestrator will be slower to diagnose dead master in the event of a locked-down master or a too-many-connections scenario.

shlomi-noach · 2020-11-08T09:36:37Z

Addressed by #1264

pedroalb changed the title ~~Orchestrator promotes lagging replica when configured with MariaDB GTID~~ Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID Nov 2, 2020

shlomi-noach self-assigned this Nov 2, 2020

shlomi-noach mentioned this issue Nov 8, 2020

Skip RestartReplicationQuick() on MariaDB with GTID #1264

Merged

shlomi-noach closed this as completed in #1264 Nov 19, 2020

This was referenced Jun 23, 2021

Orchestrator promotes a replica with lag on Mariadb #1363

Closed

Handle MariaDB behavior of dropping relay log entries on failure scenarios #1374

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

pedroalb commented Nov 2, 2020

shlomi-noach commented Nov 2, 2020

shlomi-noach commented Nov 8, 2020

Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

Comments

pedroalb commented Nov 2, 2020

shlomi-noach commented Nov 2, 2020

shlomi-noach commented Nov 8, 2020