Orchestrator promotes a replica with lag on Mariadb #1363

mohankara · 2021-06-02T16:44:13Z

We have a 3 node MariaDB 10.5.10 setup on Centos. 1 Primary and 2 replicas with semi-sync enabled.
Our current orchestrator version is 3.2.4

We had a scenario where the replicas were lagging by few hours, master was not reachable so one of the replicas was promoted as primary in spite of the huge lag. This resulted in a data loss. Ideally orchestrator should wait for the replica's relay logs to be applied on the replica then promote as a master. This seems to be the behavior on MySQL based on my testing but not on Mariadb.

--Test case:
Tests against MySQL and Mariadb are done with these orchestrator parameters in /etc/orchaestrator.conf.json

"DelayMasterPromotionIfSQLThreadNotUpToDate": true,
"debug": true

Restart orchestrator on all 3 nodes
I)Test on MariaDB:
Start a 3 node Mariadb cluster (Semi-sync enabled)
1.Create and add data to a test table
create table test (colA int, colB int, colC datetime, colD int);

insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);

Stop slave SQL_THREAD on replicas (Node 2, 3)
Wait for few secs and add some more data to Node 1 (master)
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
Stop mysqld on Master (Node 1)
You will see the orchestrator promoting a replica without the data added in Step Rqlite bin #3.

Test on MySQL 5.7.32:
Repeat the same test on 3 node MySQL
You will notice orchestrator promoting one of the replicas with out any data loss ie seeing 14 rows !!!

Thank You
Mohan

shlomi-noach · 2021-06-08T11:56:12Z

This PR goes at least halfway though: #1366

Question: in your actual production incident, was SQL_thread running at time of failover, to the best of your knowledge?

mohankara · 2021-06-08T13:53:17Z

In 1 case it was running but behind, in another it was down due to an error. In both cases the replica was wrongly promoted.

mohankara · 2021-06-12T12:01:56Z

Any ETA on when the fix will be available ? We have few sites that experience this issue.

shlomi-noach · 2021-06-14T05:14:20Z

I hope to reproduce this week.

mohankara · 2021-06-14T21:47:50Z

I hope to reproduce this week.

Thanks Shlomi. I am looking forward to it.

shlomi-noach · 2021-06-16T07:26:31Z

I followed the reproduction steps, but this does not reproduce for me: both replicas apply the relay log and then get promoted. I'm using current master branch.
This was very manual. I'll need to set up something more automated so I can test again and again and again. I need to think about how to do that without exhausting my resources.

mohankara · 2021-06-16T11:41:24Z

I followed the reproduction steps, but this does not reproduce for me: both replicas apply the relay log and then get promoted. I'm using current master branch.
That's what we want correct ? Is this with your new fixes in place or with 3.2.4 ?

This was very manual. I'll need to set up something more automated so I can test again and again and again. I need to think about how to do that without exhausting my resources.
This issue happens only with Mariadb. MySQL works fine. I have some cycles for testing if I can be of any assistance to you.

shlomi-noach · 2021-06-16T13:24:04Z

This issue happens only with Mariadb. MySQL works fine. I have some cycles for testing if I can be of any assistance to you.

Yes, of course. I used MariaDB 10.5.10, same as reported.

shlomi-noach · 2021-06-17T11:18:50Z

I'll continue early next week. This week is increasingly busy

shiweng · 2021-06-21T08:10:53Z

This issue happens only with Mariadb. MySQL works fine. I have some cycles for testing if I can be of any assistance to you.

Yes, of course. I used MariaDB 10.5.10, same as reported.

hi,may be I have similar issues,
Orchestrator version: 3.0.14
mysql version: mysql-community-8.0.19
a replication cluster with 1 master and 2 slaves,use pt-heartbeat as slowquery monitor, and I set following parameters:
DelayMasterPromotionIfSQLThreadNotUpToDate: true
FailMasterPromotionIfSQLThreadNotUpToDate: false
FailMasterPromotionOnLagMinutes:0

when both slaves were lag behind master for more than 60 seconds, I issue "shutdown;" on master, and I hope orchestrator will wait for one of slaves sqlthreaduptodate with master, then begin a promotion.however, promotion begin immediately when master gone,so promoted slave will lost some transactions. this is not what we are expected.

now,I have to set parameters on my production environment:
DelayMasterPromotionIfSQLThreadNotUpToDate: false
FailMasterPromotionIfSQLThreadNotUpToDate: true
FailMasterPromotionOnLagMinutes:0

with these settings, promotion will fail when all slaves are lag behind with master.
will you do me a favor,thanks.

by the way, I use sysbench to simulate pressure on master,and I set "set global binlog_group_commit_sync_delay=20000;" on slaves to simulate lags.

mohankara · 2021-06-22T11:53:17Z

This issue happens only with Mariadb. MySQL works fine. I have some cycles for testing if I can be of any assistance to you.

Yes, of course. I used MariaDB 10.5.10, same as reported.

hi,may be I have similar issues,
Orchestrator version: 3.0.14
mysql version: mysql-community-8.0.19
a replication cluster with 1 master and 2 slaves,use pt-heartbeat as slowquery monitor, and I set following parameters:
DelayMasterPromotionIfSQLThreadNotUpToDate: true
FailMasterPromotionIfSQLThreadNotUpToDate: false
FailMasterPromotionOnLagMinutes:0

when both slaves were lag behind master for more than 60 seconds, I issue "shutdown;" on master, and I hope orchestrator will wait for one of slaves sqlthreaduptodate with master, then begin a promotion.however, promotion begin immediately when master gone,so promoted slave will lost some transactions. this is not what we are expected.

now,I have to set parameters on my production environment:
DelayMasterPromotionIfSQLThreadNotUpToDate: false
FailMasterPromotionIfSQLThreadNotUpToDate: true
FailMasterPromotionOnLagMinutes:0

with these settings, promotion will fail when all slaves are lag behind with master.
will you do me a favor,thanks.

by the way, I use sysbench to simulate pressure on master,and I set "set global binlog_group_commit_sync_delay=20000;" on slaves to simulate lags.

I couldn't simulate this issue (replica getting promoted as master while there is lag) on MySQL 5.7. But I was using orchestrator 3.2.4. I know there are fixes related to this issue in 3.2.4. Try that version or 3.2.5 and see. Hope that works for you.

shlomi-noach · 2021-06-22T12:04:33Z

Gonna clear some time tomorrow for testing.

shlomi-noach · 2021-06-23T08:39:23Z

Can you please share your MariaDB config? Also, a SHOW SLAVE STATUS on any of the replicas?

shlomi-noach · 2021-06-23T09:09:55Z

I'm able to reproduce.

shlomi-noach · 2021-06-23T09:41:39Z

OK this is does after all circles back to the surprising behavior on MariaDB, explained in #1260 (comment).

As it turns out, when a replica cannot connect to its master/primary, and tries to reconnect, it resets the value of Read_Master_Log_Pos. To illustrate and to explain why it matters, consider this scenario:

MariaDB primary + 2 replicas in GTID=current_pos and with semi sync enabled with "infinite" timeout
We create a table on the primary, insert one row
all 3 nodes show that one row
we stop slave sql_thread on both replicas
we insert 2nd row on the primary
since sql_thread on replicas is still working, they both pull the binlog event, acking to the primary
the primary now has 2 rows, but both replicas have 1 row.

At this time the coordinates on my test env show:

mysql -e "show slave status\G" | grep Master_Log

output on both replicas:

           Read_Master_Log_Pos: 3985
         Relay_Master_Log_File: mysql_sandbox22511-bin.000001
           Exec_Master_Log_Pos: 3784
---
               Master_Log_File: mysql_sandbox22511-bin.000001
           Read_Master_Log_Pos: 3985
         Relay_Master_Log_File: mysql_sandbox22511-bin.000001
           Exec_Master_Log_Pos: 3784

which makes perfect sense. The primary has advanced as far as pos=3985 but both replicas are only caught up as far as 3784.

Now, we stop the primary. Whatever it is that orchestrator does during promotion, we then see (right before actual promotion):

               Master_Log_File: mysql_sandbox22513-bin.000001
           Read_Master_Log_Pos: 3768
         Relay_Master_Log_File: mysql_sandbox22513-bin.000001
           Exec_Master_Log_Pos: 3768
---
               Master_Log_File: mysql_sandbox22511-bin.000001
           Read_Master_Log_Pos: 3784
         Relay_Master_Log_File: mysql_sandbox22511-bin.000001
           Exec_Master_Log_Pos: 3784

Notice how in both replicas, the value of Read_Master_Log_Pos is decreased from 3985... 💥
So the replicas forget that they were ahead in replication. They forget that they were supposed to track a certain position.
And that's why the failover in MariaDB behaves differently than in MySQL.

Anyway, we're here to solve this. I thought #1264 should have solved it, but apparently there's seemingly someplace else where orchestrator restarts IO_THREAD on the replicas. So I'm gonna try and find where that happens and take it from there.

shlomi-noach · 2021-06-23T11:31:19Z

MariaDB'a behavior baffles me. Irrespective of orchestrator, it's keen on losing data. Here's the story:

I've fixed orchestrator behavior. We begin with this issue's use case:

127.0.0.1:22511   [0s,ok,10.5.10-MariaDB-log,rw,MIXED,>>,semi:master]
- 127.0.0.1:22512 [null,nonreplicating,10.5.10-MariaDB-log,rw,MIXED,>>,GTID,semi:replica]
- 127.0.0.1:22513 [null,nonreplicating,10.5.10-MariaDB-log,rw,MIXED,>>,GTID,semi:replica]

22511 is primary; both replicas have IO_THREAD running and SQL_THREAD stopped. There's an INSERT in both replicas' binary logs that isn't applied because of the stopped SQL_THREAD.

I kill the primary, orchestrator attempts a failover, but because it's configured with DelayMasterPromotionIfSQLThreadNotUpToDate: true it waits and waits and eventually times out without promoting the new primary. This is the desired behavior. We end up with:

127.0.0.1:22511     [unknown,invalid,10.5.10-MariaDB-log,rw,MIXED,>>,semi:master,downtimed]
- 127.0.0.1:22512   [null,nonreplicating,10.5.10-MariaDB-log,rw,MIXED,>>,GTID]
  + 127.0.0.1:22513 [0s,ok,10.5.10-MariaDB-log,rw,MIXED,>>,GTID,semi:replica]

At this time, replica 22512 still has that INSERT in its relay log, and still points to the original coordinates:

Jun 23 11:21:17                Master_Log_File: mysql_sandbox22511-bin.000001
Jun 23 11:21:17            Read_Master_Log_Pos: 3985
Jun 23 11:21:17          Relay_Master_Log_File: mysql_sandbox22511-bin.000001
Jun 23 11:21:17            Exec_Master_Log_Pos: 3784
Jun 23 11:21:17 ---
Jun 23 11:21:17                Master_Log_File: mysql_sandbox22512-bin.000001
Jun 23 11:21:17            Read_Master_Log_Pos: 3768
Jun 23 11:21:17          Relay_Master_Log_File: mysql_sandbox22512-bin.000001
Jun 23 11:21:17            Exec_Master_Log_Pos: 3768

the 22513 replica is now a child of 22512 andit makes sense that coordinates look completely different.

And no the magic trick

I stop orchestrator, taking it out of the picture.
Coordinates remain unchanged on all servers
On replica 22512 I run start slave sql_thread
Boom:

Jun 23 11:22:21                Master_Log_File: mysql_sandbox22511-bin.000001
Jun 23 11:22:21            Read_Master_Log_Pos: 3784
Jun 23 11:22:21          Relay_Master_Log_File: mysql_sandbox22511-bin.000001
Jun 23 11:22:21            Exec_Master_Log_Pos: 3784
Jun 23 11:22:21 ---
Jun 23 11:22:21                Master_Log_File: mysql_sandbox22512-bin.000001
Jun 23 11:22:21            Read_Master_Log_Pos: 3768
Jun 23 11:22:21          Relay_Master_Log_File: mysql_sandbox22512-bin.000001
Jun 23 11:22:21            Exec_Master_Log_Pos: 3768

The value of Read_Master_Log_Pos dropped down to 3768.

Why?

No idea. I consider this a bug in MariaDB, and a loss of data. In fact, I'm not sure how it is at all possible to recover relay log entries on 22512... 😮

So what's next?

I'll present an orchestrator PR that behaves correctly by not promoting the replica. But I have no idea how you, the user/DBA, will take it from there. How would you salvage the data? Please let me know if you find a solution.

shlomi-noach · 2021-06-23T11:41:38Z

Solution in #1374. Please try it out.

shlomi-noach · 2021-06-23T11:44:37Z

This is the reproduction script I used (requires MariaDB, dbdeployer):

cd ~/sandboxes/mariadb10
./stop_all 
cd ~/sandboxes
dbdeployer delete rsandbox_10_5_10
rm -rf ./rsandbox_10_5_10 ./mariadb10
dbdeployer deploy replication 10.5.10 --skip-report-host --my-cnf-options log_bin --my-cnf-options log_slave_updates --my-cnf-options "report_host=127.0.0.1"
ln -s rsandbox_10_5_10 mariadb10
cd mariadb10
./m -uroot -e "grant all on *.* to 'ci'@'%' identified by 'ci'"
./m test -e "set global rpl_semi_sync_master_enabled=1; set global rpl_semi_sync_master_timeout=10000000"
./s1 test -e "set global rpl_semi_sync_slave_enabled=1; stop slave; start slave"
./s2 test -e "set global rpl_semi_sync_slave_enabled=1; stop slave; start slave"
./s1 test -e "stop slave; change master to MASTER_USE_GTID=current_pos; start slave;"
./s2 test -e "stop slave; change master to MASTER_USE_GTID=current_pos; start slave;"
./m test -e "drop table if exists t; create table t(id bigint auto_increment primary key, val int not null)"
./m test -e "insert into t values(null,7)"
sleep 1
./use_all "select * from test.t"
./s1 test -e "stop slave sql_thread;"
./s2 test -e "stop slave sql_thread;"
./m test -e "insert into t values(null,11)"
sleep 1
./use_all "select * from test.t"
./s1 -e "show slave status\G"

./master/stop

mohankara · 2021-06-23T12:23:19Z

Thank you so much @shlomi-noach. Lot to consume and great work. I will try out the solution in #1374. Way I understand is, with this fix, the orchestrator will stop promoting a replica with lag. Is that right ?

shlomi-noach · 2021-06-23T12:57:48Z

orchestrator will stop promoting a replica with lag. Is that right ?

Correct, as per #1363 (comment), leading to

127.0.0.1:22511     [unknown,invalid,10.5.10-MariaDB-log,rw,MIXED,>>,semi:master,downtimed]
- 127.0.0.1:22512   [null,nonreplicating,10.5.10-MariaDB-log,rw,MIXED,>>,GTID]
  + 127.0.0.1:22513 [0s,ok,10.5.10-MariaDB-log,rw,MIXED,>>,GTID,semi:replica]

mohankara · 2021-06-23T13:28:09Z

Thnx. Will this fix be available in 3.2.5 or next release ?

shlomi-noach · 2021-06-23T14:16:52Z

Will this fix be available in 3.2.5 or next release ?

Let's first test it?

mohankara · 2021-06-23T17:20:08Z

@shlomi-noach - I opened a case with MariaDB support on why data gets lost (if we take orchestrator out of the equation). Here's their reply. Is this possible ?
Mohan,
One thing I may have to understand is how your orchestrator promotes slave to master even there there is still slave lag. In a perfect world, if orchestrator detects master's failure,
it should run CHANGE MASTER TO RELAY_LOG_FILE='slave-relay-bin.xxx', RELAY_LOG_POS=xxx; until relay log on slave should be fully processed and then should promote that slave to master. This is somewhat explained here - https://mariadb.com/kb/en/change-master-to/#relay-log-options.
I hope it would be helpful to you.

mohankara · 2021-06-24T02:19:51Z

@shlomi-noach - I tried to clone the branch, compile and start orchestrator, but wouldn't come up.
Would it be possible for you to provide a binary file that I can use ?

15:07 $ ./script/build
go version go1.15.6 darwin/amd64 found in : Go Binary: /usr/local/bin/go
++ rm -rf .gopath
++ mkdir -p .gopath/src/github.com/openark
++ ln -s /Users/mohan.karapundi/repo/orchestrator .gopath/src/github.com/openark/orchestrator
++ export GOPATH=/Users/mohan.karapundi/repo/orchestrator/.gopath:/Users/mohan.karapundi/repo/orchestrator/.vendor
++ GOPATH=/Users/mohan.karapundi/repo/orchestrator/.gopath:/Users/mohan.karapundi/repo/orchestrator/.vendor

mkdir -p bin
bindir=/Users/mohan.karapundi/repo/orchestrator/bin
scriptdir=/Users/mohan.karapundi/repo/orchestrator/script
++ git rev-parse HEAD
version=b381c69896604a719ee72563d4d2f81015778ecb
++ git describe --tags --always --dirty
describe=v3.2.5-7-gb381c698
export GOPATH=/Users/mohan.karapundi/repo/orchestrator/.gopath
// Scenarios where we might cancel the promotion.
GOPATH=/Users/mohan.karapundi/repo/orchestrator/.gopath
cd .gopath/src/github.com/openark/orchestrator
go build -i -o /Users/mohan.karapundi/repo/orchestrator/bin/orchestrator -ldflags '-X main.AppVersion=b381c69896604a719ee72563d4d2f81015778ecb -X main.BuildDescribe=v3.2.5-7-gb381c698' ./go/cmd/orchestrator/main.go

github.com/openark/orchestrator/go/logic

go/logic/topology_recovery.go:883:4: caughtUpReplica declared but not used

--To fix the error, I had to replace this line in go/logic/topology_recovery.go
From
caughtUpReplica, err := inst.WaitForSQLThreadUpToDate(&promotedReplica.Key, 0, 0)
To
_, err := inst.WaitForSQLThreadUpToDate(&promotedReplica.Key, 0, 0)

It then generated orchestrator binary file. I copied this to my server and restarted orchestrator.
But orchestrator wouldn't come up. Here's the error in orchestrator log file.

/bin/bash: /usr/local/orchestrator/orchestrator: cannot execute binary file

shlomi-noach · 2021-06-24T05:39:41Z

Apologies, the build was broken and I didn't even notice. It's now fixed, and here's the binary: https://github.com/openark/orchestrator/suites/3073155843/artifacts/70038639

shlomi-noach · 2021-06-24T05:43:44Z

In a perfect world, if orchestrator detects master's failure,
it should run CHANGE MASTER TO RELAY_LOG_FILE='slave-relay-bin.xxx', RELAY_LOG_POS=xxx; until relay log on slave should be fully processed and then should promote that slave to master. This is somewhat explained here - https://mariadb.com/kb/en/change-master-to/#relay-log-options.

I confess this is confusing to me. The replica is already at those relay log coordinates. I'm not sure if the intention is that we re-apply those same coordinates via change master to? The documentation is a bit fuzzy, too.

FWIW in #1374 and as illustrated, orchestrator does not promote the replica as primary, and that is the intended behavior. therefore the comment might be irrelevant? In the scenario I've presented the failover is prevented (this is expected behavior), and then the replica is still unable to process its relay logs.
I can look into this again next week.

mohankara · 2021-06-24T11:36:13Z

Apologies, the build was broken and I didn't even notice. It's now fixed, and here's the binary: https://github.com/openark/orchestrator/suites/3073155843/artifacts/70038639

Thank you Sir. I appreciate it. I will give it a shot and let you know.

mohankara · 2021-06-24T11:39:20Z

In a perfect world, if orchestrator detects master's failure,
it should run CHANGE MASTER TO RELAY_LOG_FILE='slave-relay-bin.xxx', RELAY_LOG_POS=xxx; until relay log on slave should be fully processed and then should promote that slave to master. This is somewhat explained here - https://mariadb.com/kb/en/change-master-to/#relay-log-options.

I confess this is confusing to me. The replica is already at those relay log coordinates. I'm not sure if the intention is that we re-apply those same coordinates via change master to? The documentation is a bit fuzzy, too.

FWIW in #1374 and as illustrated, orchestrator does not promote the replica as primary, and that is the intended behavior. therefore the comment might be irrelevant? In the scenario I've presented the failover is prevented (this is expected behavior), and then the replica is still unable to process its relay logs.
I can look into this again next week.

Didn't make sense to me either when I read it. What I find interesting is that, even if you take orchestrator out of this equation, the behavior between MySQL and MariaDB differ a lot when the slave thread is restarted. I will do some testing and let MariaDB support know. What would be ideal is... Primary goes down, orchestrator waits for one of the replicas to be caught up then promote.

shlomi-noach · 2021-06-24T11:52:36Z

So, I tried to reproduce without any orchestrator involvement whatsoever:

one primary, two semi sync replicas
insert on primary
stop sql thread on replicas
insert on primary
stop primary

from this point, I ran start slave sql_thread on the replicas -- and they caught up just fine. So it does seem like orchestrator does something here, though I'm not sure what it is: it actually does not issue a change master on the designated replica. And it does not stop the sql thread on that replica. But again, smething happens that causes that replica to then misbehave upon start slave sql_thread.

mohankara · 2021-06-24T12:46:59Z

Looks like GLIBC is required by orchestrator. I will try to download and see if I can get past this error.
/usr/local/orchestrator/orchestrator: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /usr/local/orchestrator/orchestrator)

shlomi-noach · 2021-06-24T12:50:53Z

what OS are you running on?

Try ./script/build to build locally.

mohankara · 2021-06-24T14:02:31Z

I am on CentOS

mohankara · 2021-06-25T01:55:49Z

I was able to build successfully by downloading from the branch recovery-mariadb-delay-replication.
I added DelayMasterPromotionIfSQLThreadNotUpToDate=true to conf. Restarted orchestrator on all 3 nodes.
I tested few times with slave thread being down on both replicas.
Unfortunately, I still see one of the replicas being promoted as master. No change in behavior.
I am not sure if I am missing something here.

shlomi-noach · 2021-06-25T05:19:59Z

Logs with --debug please?

shiweng · 2021-06-25T08:56:48Z

This issue happens only with Mariadb. MySQL works fine. I have some cycles for testing if I can be of any assistance to you.

Yes, of course. I used MariaDB 10.5.10, same as reported.

hi,may be I have similar issues,
Orchestrator version: 3.0.14
mysql version: mysql-community-8.0.19
a replication cluster with 1 master and 2 slaves,use pt-heartbeat as slowquery monitor, and I set following parameters:
DelayMasterPromotionIfSQLThreadNotUpToDate: true
FailMasterPromotionIfSQLThreadNotUpToDate: false
FailMasterPromotionOnLagMinutes:0
when both slaves were lag behind master for more than 60 seconds, I issue "shutdown;" on master, and I hope orchestrator will wait for one of slaves sqlthreaduptodate with master, then begin a promotion.however, promotion begin immediately when master gone,so promoted slave will lost some transactions. this is not what we are expected.
now,I have to set parameters on my production environment:
DelayMasterPromotionIfSQLThreadNotUpToDate: false
FailMasterPromotionIfSQLThreadNotUpToDate: true
FailMasterPromotionOnLagMinutes:0
with these settings, promotion will fail when all slaves are lag behind with master.
will you do me a favor,thanks.
by the way, I use sysbench to simulate pressure on master,and I set "set global binlog_group_commit_sync_delay=20000;" on slaves to simulate lags.

I couldn't simulate this issue (replica getting promoted as master while there is lag) on MySQL 5.7. But I was using orchestrator 3.2.4. I know there are fixes related to this issue in 3.2.4. Try that version or 3.2.5 and see. Hope that works for you.

thanks,mohankara! I tested successfully with orchestrator 3.2.4

mohankara · 2021-06-25T12:11:55Z

Attaching orchestrator log file from the raft leader node.
orchestrator.log.zip
From other 2 nodes as well
orchestrator_node1.log
orchestrator_node2.log

shlomi-noach · 2021-06-27T06:58:56Z

I just added a little bit more audit messages. Can you please run again? On my end the behavior is correct.
Looking at your logs, it appears like orchestrator thinks the SQL thread is up-to-date. But we'll know better with the extra logging.

mohankara · 2021-06-27T17:11:23Z

Attached are the zip files from orchestrator logs.
I hope I am picking the right code here,
[root@node1 orchestrator]# ./orchestrator --version
5347e58

orch_logs_0627.zip

shlomi-noach · 2021-06-28T06:36:56Z

Your binary is not the right version. It should be b381c69896604a719ee72563d4d2f81015778ecb (latest commit).

I assume your CentOS is CentOS 6.X? (which has an older glibc). How were you able to use the orchestrator binaries in the first place? I've long since stopped shipping orchestrator binaries for CentOS 6.

To build, please follow these steps:

git clone git@github.com:openark/orchestrator.git
cd orchestrator
git fetch origin recovery-mariadb-delay-replication
git checkout recovery-mariadb-delay-replication
./script/build
ls -l bin/orchestrator # this is the binary

mohankara · 2021-06-28T11:44:31Z

I am running CentOS 7.9. I will rebuild using the steps you have provided me and let you know. Hope this is it ;)

mohankara · 2021-06-28T14:00:06Z

ok, tried rebuilding again. The version is 0308d... which I think is the latest with the audit changes you made. Let me test and see.
[root@node1 orchestrator]# ./orchestrator --version
0308d61

mohankara · 2021-06-29T02:28:56Z

I tested few times today. I can see the orchestrator is NOT promoting replicas with lag (or when SQL_THREAD is stopped). That's a good sign.
However I noticed when Node 1 is master and down, I noticed Node 2 is pointing at Node 1 as master (correct) but Node 3 is pointing as Node 2 as master. I think you also highlighted/noticed the same in your test.
Is that normal or can something be done to change it in this situation ? But node 2 or node3 is a master which is right.
As soon as I promoted Node 2 as master, Node 3 became a replica of Node 2.

To make Node 2 as master, I did these steps. There was no data loss of any kind.

On Node 2 -
STOP SLAVE SQL_THREAD;
CHANGE MASTER TO
RELAY_LOG_FILE='<relay-bin.000002>',
RELAY_LOG_POS=;
START SLAVE SQL_THREAD;
Once Node 2 slave is caught up, stop slave; Promote Node 2 as master.
Then make Node 3 as a replica of Node 2.
Somehow if the above logic can be incorporated in to orchestrator, that would be superb !!!
I will run some more tests and let you know.
THANK YOU SO MUCH FOR YOUR HELP !!!

shlomi-noach · 2021-06-29T05:24:38Z

Is that normal or can something be done to change it in this situation

That's the normal and expected behavior, and is unlikely to change in the near future. orchestrator prepares the topology for promotion, picks the best candidate, and just before promotion validates that the candidate's state still complies with the constraints. It may be possible in the future to push down some conditions down the topology.

I'm still surprised about the need to repoint the replica onto the same relay log position, just to prevent it from purging relay logs. I'll await your tests.

mohankara · 2021-06-29T11:56:11Z

I'm still surprised about the need to repoint the replica onto the same relay log position, just to prevent it from purging relay logs. I'll await your tests.

Yeah me too. In a normal scenario when I stop slave SQL_THREAD and start SQL_THREAD, I don't lose any data, it starts from where it left off. This happens only when CHANGE MASTER command is executed and change master as per the doc, purges the relay bin logs as mentioned here https://mariadb.com/kb/en/change-master-to/#relay-log-options.
This behavior is different in MySQL.

I will run tests and let you know.

mohankara · 2021-06-30T01:20:07Z

Ran few more tests, seems fine to me. Didn't see orchestrator promoting a replica with lag. I will continue to run more tests and let you know if I see anything different.
Will this be rolled into 3.2.6 ? When is that planned ? Please let me know.

shlomi-noach · 2021-06-30T05:42:16Z

Will this be rolled into 3.2.6 ? When is that planned ? Please let me know.

It will be rolled out into the next release; I don't work with strict schedule, to be honest, I just release when it makes sense or when users remind me I'm late (very shameful).

I'm a bit dissatisfied that I don't have CI coverage for MariaDB, I'd like to have this fix tested & validated. But I don't have the time right now to invest in that path.

shlomi-noach mentioned this issue Jun 23, 2021

Handle MariaDB behavior of dropping relay log entries on failure scenarios #1374

Merged

shlomi-noach closed this as completed in #1374 Jun 30, 2021

Orchestrator promotes a replica with lag on Mariadb #1363

Orchestrator promotes a replica with lag on Mariadb #1363

Comments

mohankara commented Jun 2, 2021 • edited Loading

shlomi-noach commented Jun 8, 2021

mohankara commented Jun 8, 2021

mohankara commented Jun 12, 2021

shlomi-noach commented Jun 14, 2021

mohankara commented Jun 14, 2021

shlomi-noach commented Jun 16, 2021

mohankara commented Jun 16, 2021

shlomi-noach commented Jun 16, 2021

shlomi-noach commented Jun 17, 2021

shiweng commented Jun 21, 2021

mohankara commented Jun 22, 2021

shlomi-noach commented Jun 22, 2021

shlomi-noach commented Jun 23, 2021

shlomi-noach commented Jun 23, 2021

shlomi-noach commented Jun 23, 2021 • edited Loading

shlomi-noach commented Jun 23, 2021

And no the magic trick

Why?

So what's next?

shlomi-noach commented Jun 23, 2021

shlomi-noach commented Jun 23, 2021

mohankara commented Jun 23, 2021

shlomi-noach commented Jun 23, 2021

mohankara commented Jun 23, 2021

shlomi-noach commented Jun 23, 2021

mohankara commented Jun 23, 2021

mohankara commented Jun 24, 2021

@shlomi-noach - I tried to clone the branch, compile and start orchestrator, but wouldn't come up. Would it be possible for you to provide a binary file that I can use ?

github.com/openark/orchestrator/go/logic

shlomi-noach commented Jun 24, 2021

shlomi-noach commented Jun 24, 2021

mohankara commented Jun 24, 2021

mohankara commented Jun 24, 2021

shlomi-noach commented Jun 24, 2021

mohankara commented Jun 24, 2021

shlomi-noach commented Jun 24, 2021 • edited Loading

mohankara commented Jun 24, 2021

mohankara commented Jun 25, 2021

shlomi-noach commented Jun 25, 2021

shiweng commented Jun 25, 2021

mohankara commented Jun 25, 2021 • edited Loading

shlomi-noach commented Jun 27, 2021

mohankara commented Jun 27, 2021

shlomi-noach commented Jun 28, 2021 • edited Loading

mohankara commented Jun 28, 2021

mohankara commented Jun 28, 2021

mohankara commented Jun 29, 2021

shlomi-noach commented Jun 29, 2021

mohankara commented Jun 29, 2021 • edited by shlomi-noach Loading

mohankara commented Jun 30, 2021

shlomi-noach commented Jun 30, 2021

mohankara commented Jun 2, 2021 •

edited

Loading

shlomi-noach commented Jun 23, 2021 •

edited

Loading

@shlomi-noach - I tried to clone the branch, compile and start orchestrator, but wouldn't come up.
Would it be possible for you to provide a binary file that I can use ?

shlomi-noach commented Jun 24, 2021 •

edited

Loading

mohankara commented Jun 25, 2021 •

edited

Loading

shlomi-noach commented Jun 28, 2021 •

edited

Loading

mohankara commented Jun 29, 2021 •

edited by shlomi-noach

Loading