Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

Closed
pedroalb opened this issue Nov 2, 2020 · 2 comments · Fixed by #1264
Closed
Assignees

Comments

@pedroalb
Copy link

pedroalb commented Nov 2, 2020

Hi!

( This is an issue created after the follow-up discussion on the MySQL Community slack: https://mysqlcommunity.slack.com/archives/C9AB5JVNG/p1604228957063100 )

I am setting up automatic failovers on MariaDB and I chose to use orchestrator as the tool to help me there. I’m doing the tests and so far, it’s working great. However, testing data loss on failovers, I can’t seem to make FailMasterPromotionIfSQLThreadNotUpToDate or DelayMasterPromotionIfSQLThreadNotUpToDate work.

Test environment:

  • primary with GTID and semi-sync enabled (semi-sync timeout huge enough. I prefer to be down than have data loss).
  • secondaries with GTID and semi-sync enabled (relay_log_purge=0 and relay-log-recovery=0).

One of the tests I’m doing is to prove that orchestrator allows the candidate replica to apply all the relay logs before orchestrator resets the slave configurations and promotes it as primary.
So, I have sysbench running on a 4th node with enough workload to make replicas lag. Once they are lagging for, like 30 seconds, I shutdown the primary node.
Once the primary node is shutdown, orchestrator promotes one replica but doesn’t allow the replica to apply all the relay logs (with DelayMasterPromotionIfSQLThreadNotUpToDate ) or doesn’t stop the failover (with FailMasterPromotionIfSQLThreadNotUpToDate ).

Seems that when using master_use_gtid to current_pos (or slave_pos ) and both replica threads (i/o and sql) are restarted, i/o thread purges all relay logs and starts pulling from the last position that sql thread has applied.

In orchestrator, this happens when RestartReplicationQuick comes and restarts both threads here. Then, because i/o thread and sql thread have the same binlog coordinates, orchestrator thinks that replica is in sync and proceeds with the failover here .

To recap:

  • this has nothing to do with heartbeat/lag evaluation
  • in mariadb
  • upon master failure
  • orchestrator runs a RestartReplicationQuick (stop slave sql_thread, stop slave io_thread, start slave io_thread, start slave sql_thread). This happens specifically when replicas are lagging at the tim emaster is failed
  • causing IO thread to jump back to an earlier position (this only happens with mariadb)
  • causing both SQL and IO threads to point to same position
  • leading orchestrator to think SQL thread is up to date with IO thread
  • promoting the replica prematurely

Thank you!

@pedroalb pedroalb changed the title Orchestrator promotes lagging replica when configured with MariaDB GTID Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID Nov 2, 2020
@shlomi-noach
Copy link
Collaborator

Thank you. Also, @jfg956 points out that he has observed this behavior in MariaDB and reported it:

So, upon stop slave io_thread; start slave io_thead, MariaDB purges the relay logs, which is exactly what we don't want to happen if the master dies.

I believe orchestrator needs to have special behavior for MariaDB+GTID, where it does not invoke the function RestartReplicationQuick(). This is unfortunate, and orchestrator will be slower to diagnose dead master in the event of a locked-down master or a too-many-connections scenario.

@shlomi-noach
Copy link
Collaborator

Addressed by #1264

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants