-
Notifications
You must be signed in to change notification settings - Fork 937
Orchestrator promotes a replica with lag on Mariadb #1363
Comments
This PR goes at least halfway though: #1366 Question: in your actual production incident, was |
In 1 case it was running but behind, in another it was down due to an error. In both cases the replica was wrongly promoted. |
Any ETA on when the fix will be available ? We have few sites that experience this issue. |
I hope to reproduce this week. |
Thanks Shlomi. I am looking forward to it. |
I followed the reproduction steps, but this does not reproduce for me: both replicas apply the relay log and then get promoted. I'm using current |
|
Yes, of course. I used MariaDB 10.5.10, same as reported. |
I'll continue early next week. This week is increasingly busy |
hi,may be I have similar issues, when both slaves were lag behind master for more than 60 seconds, I issue "shutdown;" on master, and I hope orchestrator will wait for one of slaves sqlthreaduptodate with master, then begin a promotion.however, promotion begin immediately when master gone,so promoted slave will lost some transactions. this is not what we are expected. now,I have to set parameters on my production environment: with these settings, promotion will fail when all slaves are lag behind with master. by the way, I use sysbench to simulate pressure on master,and I set "set global binlog_group_commit_sync_delay=20000;" on slaves to simulate lags. |
I couldn't simulate this issue (replica getting promoted as master while there is lag) on MySQL 5.7. But I was using orchestrator 3.2.4. I know there are fixes related to this issue in 3.2.4. Try that version or 3.2.5 and see. Hope that works for you. |
Gonna clear some time tomorrow for testing. |
Can you please share your MariaDB config? Also, a |
I'm able to reproduce. |
OK this is does after all circles back to the surprising behavior on MariaDB, explained in #1260 (comment). As it turns out, when a replica cannot connect to its master/primary, and tries to reconnect, it resets the value of
At this time the coordinates on my test env show: mysql -e "show slave status\G" | grep Master_Log output on both replicas:
which makes perfect sense. The primary has advanced as far as pos= Now, we stop the primary. Whatever it is that
Notice how in both replicas, the value of Anyway, we're here to solve this. I thought #1264 should have solved it, but apparently there's seemingly someplace else where |
MariaDB'a behavior baffles me. Irrespective of I've fixed
I kill the primary,
At this time, replica
the And no the magic trick
The value of Why?No idea. I consider this a bug in MariaDB, and a loss of data. In fact, I'm not sure how it is at all possible to recover relay log entries on So what's next?I'll present an |
Solution in #1374. Please try it out. |
This is the reproduction script I used (requires MariaDB, dbdeployer): cd ~/sandboxes/mariadb10
./stop_all
cd ~/sandboxes
dbdeployer delete rsandbox_10_5_10
rm -rf ./rsandbox_10_5_10 ./mariadb10
dbdeployer deploy replication 10.5.10 --skip-report-host --my-cnf-options log_bin --my-cnf-options log_slave_updates --my-cnf-options "report_host=127.0.0.1"
ln -s rsandbox_10_5_10 mariadb10
cd mariadb10
./m -uroot -e "grant all on *.* to 'ci'@'%' identified by 'ci'"
./m test -e "set global rpl_semi_sync_master_enabled=1; set global rpl_semi_sync_master_timeout=10000000"
./s1 test -e "set global rpl_semi_sync_slave_enabled=1; stop slave; start slave"
./s2 test -e "set global rpl_semi_sync_slave_enabled=1; stop slave; start slave"
./s1 test -e "stop slave; change master to MASTER_USE_GTID=current_pos; start slave;"
./s2 test -e "stop slave; change master to MASTER_USE_GTID=current_pos; start slave;"
./m test -e "drop table if exists t; create table t(id bigint auto_increment primary key, val int not null)"
./m test -e "insert into t values(null,7)"
sleep 1
./use_all "select * from test.t"
./s1 test -e "stop slave sql_thread;"
./s2 test -e "stop slave sql_thread;"
./m test -e "insert into t values(null,11)"
sleep 1
./use_all "select * from test.t"
./s1 -e "show slave status\G"
./master/stop |
Thank you so much @shlomi-noach. Lot to consume and great work. I will try out the solution in #1374. Way I understand is, with this fix, the orchestrator will stop promoting a replica with lag. Is that right ? |
Correct, as per #1363 (comment), leading to
|
Thnx. Will this fix be available in 3.2.5 or next release ? |
Let's first test it? |
@shlomi-noach - I opened a case with MariaDB support on why data gets lost (if we take orchestrator out of the equation). Here's their reply. Is this possible ? |
@shlomi-noach - I tried to clone the branch, compile and start orchestrator, but wouldn't come up.
|
Apologies, the build was broken and I didn't even notice. It's now fixed, and here's the binary: https://github.com/openark/orchestrator/suites/3073155843/artifacts/70038639 |
I confess this is confusing to me. The replica is already at those relay log coordinates. I'm not sure if the intention is that we re-apply those same coordinates via FWIW in #1374 and as illustrated, |
Thank you Sir. I appreciate it. I will give it a shot and let you know. |
Didn't make sense to me either when I read it. What I find interesting is that, even if you take orchestrator out of this equation, the behavior between MySQL and MariaDB differ a lot when the slave thread is restarted. I will do some testing and let MariaDB support know. What would be ideal is... Primary goes down, orchestrator waits for one of the replicas to be caught up then promote. |
So, I tried to reproduce without any
from this point, I ran |
Looks like GLIBC is required by orchestrator. I will try to download and see if I can get past this error. |
what OS are you running on? Try |
I am on CentOS |
I was able to build successfully by downloading from the branch recovery-mariadb-delay-replication. |
Logs with --debug please? |
thanks,mohankara! I tested successfully with orchestrator 3.2.4 |
Attaching orchestrator log file from the raft leader node. |
I just added a little bit more audit messages. Can you please run again? On my end the behavior is correct. |
Attached are the zip files from orchestrator logs. |
Your binary is not the right version. It should be I assume your CentOS is CentOS 6.X? (which has an older To build, please follow these steps: git clone git@github.com:openark/orchestrator.git
cd orchestrator
git fetch origin recovery-mariadb-delay-replication
git checkout recovery-mariadb-delay-replication
./script/build
ls -l bin/orchestrator # this is the binary |
I am running CentOS 7.9. I will rebuild using the steps you have provided me and let you know. Hope this is it ;) |
ok, tried rebuilding again. The version is 0308d... which I think is the latest with the audit changes you made. Let me test and see. |
I tested few times today. I can see the orchestrator is NOT promoting replicas with lag (or when SQL_THREAD is stopped). That's a good sign. To make Node 2 as master, I did these steps. There was no data loss of any kind.
|
That's the normal and expected behavior, and is unlikely to change in the near future. I'm still surprised about the need to repoint the replica onto the same relay log position, just to prevent it from purging relay logs. I'll await your tests. |
Yeah me too. In a normal scenario when I stop slave SQL_THREAD and start SQL_THREAD, I don't lose any data, it starts from where it left off. This happens only when CHANGE MASTER command is executed and change master as per the doc, purges the relay bin logs as mentioned here https://mariadb.com/kb/en/change-master-to/#relay-log-options. I will run tests and let you know. |
Ran few more tests, seems fine to me. Didn't see orchestrator promoting a replica with lag. I will continue to run more tests and let you know if I see anything different. |
It will be rolled out into the next release; I don't work with strict schedule, to be honest, I just release when it makes sense or when users remind me I'm late (very shameful). I'm a bit dissatisfied that I don't have CI coverage for MariaDB, I'd like to have this fix tested & validated. But I don't have the time right now to invest in that path. |
We have a 3 node MariaDB 10.5.10 setup on Centos. 1 Primary and 2 replicas with semi-sync enabled.
Our current orchestrator version is 3.2.4
We had a scenario where the replicas were lagging by few hours, master was not reachable so one of the replicas was promoted as primary in spite of the huge lag. This resulted in a data loss. Ideally orchestrator should wait for the replica's relay logs to be applied on the replica then promote as a master. This seems to be the behavior on MySQL based on my testing but not on Mariadb.
--Test case:
Tests against MySQL and Mariadb are done with these orchestrator parameters in /etc/orchaestrator.conf.json
"DelayMasterPromotionIfSQLThreadNotUpToDate": true,
"debug": true
Restart orchestrator on all 3 nodes
I)Test on MariaDB:
Start a 3 node Mariadb cluster (Semi-sync enabled)
1.Create and add data to a test table
create table test (colA int, colB int, colC datetime, colD int);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
Stop slave SQL_THREAD on replicas (Node 2, 3)
Wait for few secs and add some more data to Node 1 (master)
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
Stop mysqld on Master (Node 1)
You will see the orchestrator promoting a replica without the data added in Step Rqlite bin #3.
Test on MySQL 5.7.32:
Repeat the same test on 3 node MySQL
You will notice orchestrator promoting one of the replicas with out any data loss ie seeing 14 rows !!!
Thank You
Mohan
The text was updated successfully, but these errors were encountered: