You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.
I am setting up automatic failovers on MariaDB and I chose to use orchestrator as the tool to help me there. I’m doing the tests and so far, it’s working great. However, testing data loss on failovers, I can’t seem to make FailMasterPromotionIfSQLThreadNotUpToDate or DelayMasterPromotionIfSQLThreadNotUpToDate work.
Test environment:
primary with GTID and semi-sync enabled (semi-sync timeout huge enough. I prefer to be down than have data loss).
secondaries with GTID and semi-sync enabled (relay_log_purge=0 and relay-log-recovery=0).
One of the tests I’m doing is to prove that orchestrator allows the candidate replica to apply all the relay logs before orchestrator resets the slave configurations and promotes it as primary.
So, I have sysbench running on a 4th node with enough workload to make replicas lag. Once they are lagging for, like 30 seconds, I shutdown the primary node.
Once the primary node is shutdown, orchestrator promotes one replica but doesn’t allow the replica to apply all the relay logs (with DelayMasterPromotionIfSQLThreadNotUpToDate ) or doesn’t stop the failover (with FailMasterPromotionIfSQLThreadNotUpToDate ).
Seems that when using master_use_gtid to current_pos (or slave_pos ) and both replica threads (i/o and sql) are restarted, i/o thread purges all relay logs and starts pulling from the last position that sql thread has applied.
In orchestrator, this happens when RestartReplicationQuick comes and restarts both threads here. Then, because i/o thread and sql thread have the same binlog coordinates, orchestrator thinks that replica is in sync and proceeds with the failover here .
To recap:
this has nothing to do with heartbeat/lag evaluation
in mariadb
upon master failure
orchestrator runs a RestartReplicationQuick (stop slave sql_thread, stop slave io_thread, start slave io_thread, start slave sql_thread). This happens specifically when replicas are lagging at the tim emaster is failed
causing IO thread to jump back to an earlier position (this only happens with mariadb)
causing both SQL and IO threads to point to same position
leading orchestrator to think SQL thread is up to date with IO thread
promoting the replica prematurely
Thank you!
The text was updated successfully, but these errors were encountered:
pedroalb
changed the title
Orchestrator promotes lagging replica when configured with MariaDB GTID
Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID
Nov 2, 2020
So, upon stop slave io_thread; start slave io_thead, MariaDB purges the relay logs, which is exactly what we don't want to happen if the master dies.
I believe orchestrator needs to have special behavior for MariaDB+GTID, where it does not invoke the function RestartReplicationQuick(). This is unfortunate, and orchestrator will be slower to diagnose dead master in the event of a locked-down master or a too-many-connections scenario.
Hi!
( This is an issue created after the follow-up discussion on the MySQL Community slack: https://mysqlcommunity.slack.com/archives/C9AB5JVNG/p1604228957063100 )
I am setting up automatic failovers on MariaDB and I chose to use orchestrator as the tool to help me there. I’m doing the tests and so far, it’s working great. However, testing data loss on failovers, I can’t seem to make
FailMasterPromotionIfSQLThreadNotUpToDate
orDelayMasterPromotionIfSQLThreadNotUpToDate
work.Test environment:
One of the tests I’m doing is to prove that orchestrator allows the candidate replica to apply all the relay logs before orchestrator resets the slave configurations and promotes it as primary.
So, I have sysbench running on a 4th node with enough workload to make replicas lag. Once they are lagging for, like 30 seconds, I shutdown the primary node.
Once the primary node is shutdown, orchestrator promotes one replica but doesn’t allow the replica to apply all the relay logs (with
DelayMasterPromotionIfSQLThreadNotUpToDate
) or doesn’t stop the failover (withFailMasterPromotionIfSQLThreadNotUpToDate
).Seems that when using
master_use_gtid
tocurrent_pos
(orslave_pos
) and both replica threads (i/o and sql) are restarted, i/o thread purges all relay logs and starts pulling from the last position that sql thread has applied.In orchestrator, this happens when
RestartReplicationQuick
comes and restarts both threads here. Then, because i/o thread and sql thread have the same binlog coordinates, orchestrator thinks that replica is in sync and proceeds with the failover here .To recap:
Thank you!
The text was updated successfully, but these errors were encountered: