Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

hook for graceful master switch #428

Closed
igroene opened this issue Mar 5, 2018 · 29 comments
Closed

hook for graceful master switch #428

igroene opened this issue Mar 5, 2018 · 29 comments
Assignees

Comments

@igroene
Copy link

igroene commented Mar 5, 2018

I have been running some graceful master takeover testing using ProxySQL and Orchestrator together, and I believe it would be a good idea to have a hook that is triggered even earlier than PreFailoverProcesses.
The issue with PreFailoverProcesses is that it is triggered after the demoted master has already been placed by Orchestrator in read_only mode, as shown by this extract from the log:

Mar 03 14:25:10 mysql3 orchestrator[25032]: [martini] Started GET /api/graceful-master-takeover/mysql1/3306 for 192.168.56.1
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO Will demote mysql1:3306 and promote mysql2:3306 instead
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO Stopped slave on mysql2:3306, Self:mysql-bin.000009:3034573, Exec:mysql-bin.000010:18546301
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO Will set mysql1:3306 as read_only
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO instance mysql1:3306 read_only: true
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO auditType:read-only instance:mysql1:3306 cluster:mysql1:3306 message:set as true
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO Will advance mysql2:3306 to master coordinates mysql-bin.000010:18546301
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO Will start slave on mysql2:3306 until coordinates: mysql-bin.000010:18546301
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO Stopped slave on mysql2:3306, Self:mysql-bin.000009:3034573, Exec:mysql-bin.000010:18546301
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on mysql1:3306; isActionable?: true; skipProcesses: false
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO topology_recovery: detected DeadMaster failure on mysql1:3306
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO topology_recovery: Running 1 OnFailureDetectionProcesses hooks
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 DEBUG orchestrator/raft: applying command 2055: write-recovery-step
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO topology_recovery: Running OnFailureDetectionProcesses hook 1 of 1: echo 'Detected DeadMaster on mysql1:3306. Affected replicas: 1' >> /tmp/recovery.log
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 DEBUG orchestrator/raft: applying command 2056: write-recovery-step
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO CommandRun(echo 'Detected DeadMaster on mysql1:3306. Affected replicas: 1' >> /tmp/recovery.log,[])
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-358000144
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO CommandRun successful. exit status 0
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO topology_recovery: Completed OnFailureDetectionProcesses hook 1 of 1 in 4.556463ms
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 DEBUG orchestrator/raft: applying command 2057: write-recovery-step
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO Completed OnFailureDetectionProcesses hook 1 of 1 in 4.556463ms
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO topology_recovery: done running OnFailureDetectionProcesses hooks
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 DEBUG orchestrator/raft: applying command 2058: write-recovery-step
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 DEBUG orchestrator/raft: applying command 2059: register-failure-detection
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on mysql1:3306; isRecoverable?: true; skipProcesses: false
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 DEBUG orchestrator/raft: applying command 2060: write-recovery
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO topology_recovery: will handle DeadMaster event on mysql1:3306
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 DEBUG orchestrator/raft: applying command 2061: write-recovery-step
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO auditType:recover-dead-master instance:mysql1:3306 cluster:mysql1:3306 message:problem found; will recover
Mar 03 14:25:10 mysql3 orchestrator[25032]: 2018-03-03 14:25:10 INFO topology_recovery: Running 1 PreFailoverProcesses hooks

For the ProxySQL use case, this returns errors to the application as soon as the host is set in read_only mode.
I would like to use the proposed hook to have ProxySQL set the old master to offline_soft and give active connections a chance to finish work to minimize errors the application is returned.

@shlomi-noach
Copy link
Collaborator

This makes sense. Assigning to myself.

@shlomi-noach shlomi-noach self-assigned this Mar 6, 2018
@igroene
Copy link
Author

igroene commented Mar 16, 2018

Hi @shlomi-noach just wondering if you have any plans to implement this in the near future? I understand if there are other priorities :)
Thank you!

@shlomi-noach
Copy link
Collaborator

@igroene I haven't prioritized this yet. Let me look into it.

@shlomi-noach
Copy link
Collaborator

shlomi-noach commented Mar 17, 2018

@igroene this came up, for which a PR is ready and will shortly be merged. EDIT: merged.

It takes a different approach, but I think solves your case, too. You'll get a command value injected in your hooks. The value would be graceful-master-takeover upon a graceful takeover action, or what have you if otherwise.

This was we can avoid specialized hooks. You will read the value of the command variable and make your own choices.

What do you think?

@igroene
Copy link
Author

igroene commented Mar 19, 2018

Hi @shlomi-noach unfortunately I think this does not solve the case I presented above.
The issue I encountered is not the lack of info about the master change operation, but with the order operations are performed:

  1. issue graceful master switch
  2. orchestrator sets old master as read only
  3. PreFailoverProcesses hook is triggered
    ...

I would suggest either change the order as follows:

  1. issue graceful master switch
  2. PreFailoverProcesses hook is triggered
  3. orchestrator sets old master as read only
    ...

or (probably better):

  1. issue graceful master switch
  2. New hook PreGracefulSwitchProcesses is triggered
  3. orchestrator sets old master as read only
  4. PreFailoverProcesses hook is triggered
    ...

Thank you

@shlomi-noach
Copy link
Collaborator

Ah, I see your point. Let me look into both options.

@Slach
Copy link
Contributor

Slach commented Mar 26, 2018

@shlomi-noach yes, PreGracefulSwiftProcesses would like useful feature
for swifch ProxySQL to correct active writer

@shlomi-noach
Copy link
Collaborator

I hope to propose a PR this week.

@shlomi-noach
Copy link
Collaborator

@igroene would you like to experiment with #443 ?
It is not final, and has multiple improvements for graceful-takeover, including removing the "single replica" constraints.

You will find PreGracefulTakeoverProcesses in config.

@igroene
Copy link
Author

igroene commented Mar 28, 2018

@shlomi-noach the hook worked perfectly! thank you for that. I am able to run sysbench and do a graceful switch without any errors :)

[ 2s ] thds: 4 tps: 57.86 qps: 1153.21 (r/w/o: 809.04/222.46/121.71) lat (ms,95%): 102.97 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 4 tps: 69.17 qps: 1389.32 (r/w/o: 973.33/272.65/143.34) lat (ms,95%): 87.56 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 4 tps: 46.97 qps: 957.41 (r/w/o: 670.59/191.88/94.94) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 5s ] thds: 4 tps: 30.94 qps: 613.71 (r/w/o: 428.10/122.74/62.87) lat (ms,95%): 223.34 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 4 tps: 51.12 qps: 995.30 (r/w/o: 693.60/199.46/102.24) lat (ms,95%): 125.52 err/s: 0.00 reconn/s: 0.00
[ 7s ] thds: 4 tps: 45.01 qps: 916.18 (r/w/o: 643.13/182.04/91.02) lat (ms,95%): 132.49 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 4 tps: 47.94 qps: 948.79 (r/w/o: 664.15/187.76/96.88) lat (ms,95%): 193.38 err/s: 0.00 reconn/s: 0.00
[ 9s ] thds: 4 tps: 56.08 qps: 1127.53 (r/w/o: 790.07/224.30/113.15) lat (ms,95%): 123.28 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 4 tps: 49.85 qps: 986.04 (r/w/o: 686.94/197.41/101.70) lat (ms,95%): 155.80 err/s: 0.00 reconn/s: 0.00
[ 11s ] thds: 4 tps: 54.05 qps: 1113.12 (r/w/o: 783.79/219.22/110.11) lat (ms,95%): 139.85 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 4 tps: 56.14 qps: 1128.74 (r/w/o: 793.93/220.54/114.28) lat (ms,95%): 118.92 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 4 tps: 50.88 qps: 996.59 (r/w/o: 691.33/201.51/103.75) lat (ms,95%): 155.80 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 4 tps: 30.05 qps: 601.95 (r/w/o: 425.67/116.18/60.09) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 4 tps: 61.83 qps: 1236.68 (r/w/o: 861.69/251.33/123.67) lat (ms,95%): 123.28 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 4 tps: 46.10 qps: 930.02 (r/w/o: 653.42/183.40/93.20) lat (ms,95%): 134.90 err/s: 0.00 reconn/s: 0.00
[ 17s ] thds: 4 tps: 42.01 qps: 824.21 (r/w/o: 576.15/162.04/86.02) lat (ms,95%): 189.93 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 4 tps: 7.98 qps: 180.59 (r/w/o: 132.70/31.93/15.96) lat (ms,95%): 139.85 err/s: 0.00 reconn/s: 0.00
[ 19s ] thds: 4 tps: 62.20 qps: 1238.90 (r/w/o: 864.72/246.78/127.40) lat (ms,95%): 846.57 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 97.01 qps: 1925.12 (r/w/o: 1341.09/387.02/197.01) lat (ms,95%): 53.85 err/s: 0.00 reconn/s: 0.00
[ 21s ] thds: 4 tps: 104.90 qps: 2110.96 (r/w/o: 1479.57/414.60/216.79) lat (ms,95%): 55.82 err/s: 0.00 reconn/s: 0.00

I also took the liberty of testing the other part of the PR:

graceful-master-takover takes a more permissive approach and now allows failing over when then master has multiple replicas, given that:
The user specifies the particular rdesignated eplica they want to failover to
orchestrator is able to replocate all other replicas below designated replica.

That is not working for me via GUI by drag-dropping one of the 2 existing slaves to the left of the master in a simple 3 node topology (1 master, 2 direct slaves). I get this message:

GracefulMasterTakeover: when no target instance indicated, master mysql1:3306 should only have one replica (making the takeover safe and simple), but has 2. Aborting

@Slach
Copy link
Contributor

Slach commented Mar 28, 2018

@igroene could you share your orchestrator.conf.json with gracefull update proxysql commands?

@igroene
Copy link
Author

igroene commented Mar 28, 2018

Here is orchestrator.conf.json I am using for testing:

{
  "Debug": true,
  "EnableSyslog": false,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "****",
  "MySQLTopologyCredentialsConfigFile": "",
  "MySQLTopologySSLPrivateKeyFile": "",
  "MySQLTopologySSLCertFile": "",
  "MySQLTopologySSLCAFile": "",
  "MySQLTopologySSLSkipVerify": true,
  "MySQLTopologyUseMutualTLS": false,
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orc_server_user",
  "MySQLOrchestratorPassword": "****",
  "MySQLOrchestratorCredentialsConfigFile": "",
  "MySQLOrchestratorSSLPrivateKeyFile": "",
  "MySQLOrchestratorSSLCertFile": "",
  "MySQLOrchestratorSSLCAFile": "",
  "MySQLOrchestratorSSLSkipVerify": true,
  "MySQLOrchestratorUseMutualTLS": false,
  "MySQLConnectTimeoutSeconds": 1,
  "DefaultInstancePort": 3306,
  "DiscoverByShowSlaveHosts": true,
  "InstancePollSeconds": 5,
  "UnseenInstanceForgetHours": 240,
  "SnapshotTopologiesIntervalHours": 0,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "HostnameResolveMethod": "default",
  "MySQLHostnameResolveMethod": "@@hostname",
  "SkipBinlogServerUnresolveCheck": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "RejectHostnameResolvePattern": "",
  "ReasonableReplicationLagSeconds": 10,
  "ProblemIgnoreHostnameFilters": [],
  "VerifyReplicationFilters": false,
  "ReasonableMaintenanceReplicationLagSeconds": 20,
  "CandidateInstanceExpireMinutes": 60,
  "AuditLogFile": "",
  "AuditToSyslog": false,
  "RemoveTextFromHostnameDisplay": ".mydomain.com:3306",
  "ReadOnly": false,
  "AuthenticationMethod": "",
  "HTTPAuthUser": "",
  "HTTPAuthPassword": "",
  "AuthUserHeader": "",
  "PowerAuthUsers": [
    "*"
  ],
  "ClusterNameToAlias": {
    "127.0.0.1": "test suite"
  },
  "SlaveLagQuery": "",
  "DetectClusterAliasQuery": "SELECT ifnull(max(cluster_name), '') as cluster_alias from meta.cluster where anchor=1;",
  "DetectClusterDomainQuery": "",
  "DetectInstanceAliasQuery": "",
  "DetectPromotionRuleQuery": "",
  "DataCenterPattern": "[.]([^.]+)[.][^.]+[.]mydomain[.]com",
  "PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]mydomain[.]com",
  "PromotionIgnoreHostnameFilters": [],
  "DetectSemiSyncEnforcedQuery": "",
  "ServeAgentsHttp": false,
  "AgentsServerPort": ":3001",
  "AgentsUseSSL": false,
  "AgentsUseMutualTLS": false,
  "AgentSSLSkipVerify": false,
  "AgentSSLPrivateKeyFile": "",
  "AgentSSLCertFile": "",
  "AgentSSLCAFile": "",
  "AgentSSLValidOUs": [],
  "UseSSL": false,
  "UseMutualTLS": false,
  "SSLSkipVerify": false,
  "SSLPrivateKeyFile": "",
  "SSLCertFile": "",
  "SSLCAFile": "",
  "SSLValidOUs": [],
  "URLPrefix": "",
  "StatusEndpoint": "/api/status",
  "StatusSimpleHealth": true,
  "StatusOUVerify": false,
  "AgentPollMinutes": 60,
  "UnseenAgentForgetHours": 6,
  "StaleSeedFailMinutes": 60,
  "SeedAcceptableBytesDiff": 8192,
  "PseudoGTIDPattern": "",
  "PseudoGTIDPatternIsFixedSubstring": false,
  "PseudoGTIDMonotonicHint": "asc:",
  "DetectPseudoGTIDQuery": "",
  "BinlogEventsChunkSize": 10000,
  "SkipBinlogEventsContaining": [],
  "ReduceReplicationAnalysisCount": true,
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPollSeconds": 10,
  "RecoveryPeriodBlockSeconds": 3600,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    ".*"
  ],
  "RecoverIntermediateMasterClusterFilters": [
    "_intermediate_master_pattern_"
  ],
  "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
  ],
  "PreGracefulTakeoverProcesses": [
    "/root/prefailover.sh"
  ],
  "PreFailoverProcesses": [
    ""
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [],
  "PostMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "CoMasterRecoveryMustPromoteOtherCoMaster": true,
  "DetachLostSlavesAfterMasterFailover": true,
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "MasterFailoverDetachSlaveMasterHost": false,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "PostponeSlaveRecoveryOnLagMinutes": 0,
  "OSCIgnoreHostnameFilters": [],
  "GraphiteAddr": "",
  "GraphitePath": "",
  "GraphiteConvertHostnameDotsToUnderscores": true,
  "BackendDB": "sqlite",
  "SQLite3DataFile": "/var/lib/orchestrator/orchestrator.db",
  "RaftEnabled": false,
  "RaftDatadir": "/var/lib/orchestrator",
  "RaftBind": "192.168.56.100",
  "DefaultRaftPort": 10008,
  "RaftNodes": [

          "192.168.56.100",
          "192.168.56.101",
          "192.168.56.102"          ]
}

and here is the graceful switch hook:

#!/bin/bash

OldMaster=$ORC_FAILED_HOST

(
echo 'UPDATE mysql_servers SET STATUS="OFFLINE_SOFT" WHERE hostname="'"$OldMaster"'";'
echo "LOAD MYSQL SERVERS TO RUNTIME;"
) | mysql -vvv -uivan -pivan -h mysql3 -P6032

CONNUSED=`mysql -uivan -pivan -h mysql3 -P6032 -e 'SELECT IFNULL(SUM(ConnUsed),0) FROM stats_mysql_connection_pool WHERE status="OFFLINE_SOFT" AND srv_host="'"$OldMaster"'"' -B -N 2> /dev/null`
TRIES=0
while [ $CONNUSED -ne 0 -a $TRIES -ne 20 ]
do
  CONNUSED=`mysql -uivan -pivan -h mysql3 -P6032 -e 'SELECT IFNULL(SUM(ConnUsed),0) FROM stats_mysql_connection_pool WHERE status="OFFLINE_SOFT" AND srv_host="'"$OldMaster"'"' -B -N 2> /dev/null`
  TRIES=$(($TRIES+1))
  if [ $CONNUSED -ne "0" ]; then
    sleep 0.05
  fi
done

@shlomi-noach
Copy link
Collaborator

@igroene

the hook worked perfectly!

Great!

I also took the liberty of testing the other part of the PR:

Thank you for testing!

I think you might need a refresh to reload your JavaScript which is likely cached. But I'll double check.

@igroene
Copy link
Author

igroene commented Mar 28, 2018

I tried the refresh but still getting the same error

@Slach
Copy link
Contributor

Slach commented Mar 28, 2018

@igroene how you switch to new master ?
why your config have empty PostFailoverProcesses parameter?

@igroene
Copy link
Author

igroene commented Mar 28, 2018

I am switching by drag-dropping via GUI. PostFailoverProcesses is empty because I don't need a hook for that for this test. ProxySQL will detect the change in read-only that Orchestrator does and move hosts around hostgroups as needed.

@Slach
Copy link
Contributor

Slach commented Mar 28, 2018

@igroene

ProxySQL will detect the change in read-only that Orchestrator does and move hosts around hostgroups as needed.

it will be over ProxySQL scheduler switch or over other functionality?

@shlomi-noach
Copy link
Collaborator

Very good. The rest of the features are still work in progress. It will take a while to merge the branch.

@igroene
Copy link
Author

igroene commented Mar 28, 2018

Thanks @shlomi-noach !
@Slach this is leveraging replication_hostgroup table with writer/reader hostgroups. ProxySQL switchs hosts around based on read_only value.

@shlomi-noach
Copy link
Collaborator

ProxySQL switchs hosts around based on read_only value.

@igroene I'd like to suggest this isn't good practice. See my comment on https://mydbops.wordpress.com/2018/03/15/proxysql-series-seamless-replication-switchover-using-mha/, but I will write a more elaborate blog post.

@igroene
Copy link
Author

igroene commented Mar 28, 2018

Thanks for the warning, I agree with your comment. Just to clarify this is just a testing playground, I wouldn't use this for a prod deployment.

@shlomi-noach
Copy link
Collaborator

@igroene I still think this is a JavaScript issue. Can you please check the following? Source cluster.js from your browser, and look for /api/graceful-master-takeover/.

Does the entire line read:

        apiCommand("/api/graceful-master-takeover/" + existingMasterNode.Key.Hostname + "/" + existingMasterNode.Key.Port + "/" + newMasterNode.Key.Hostname + "/" + newMasterNode.Key.Port);

or

        apiCommand("/api/graceful-master-takeover/" + existingMasterNode.Key.Hostname + "/" + existingMasterNode.Key.Port);

?

@igroene
Copy link
Author

igroene commented Mar 29, 2018

You are right, the version being displayed by the browser reads as the first example altought I can't find the reason for that. Tried deleting browser cache, 2 other browsers and still get the same. I double checked the version I compiled indeed has the correct version so I am clueless at this point. Will keep investigating.

@shlomi-noach
Copy link
Collaborator

The first example is the desired one, actually 😛

@igroene
Copy link
Author

igroene commented Mar 29, 2018

hahaha you are right. I figured out my issue... I had compiled the branch with the changes but only replaced the binary and not the resources dir on /usr/local. Sorry about that.
Getting a different error now:
image
Any ideas?
EDIT: here is the full error msg

GracefulMasterTakeover: desginated instance mysql2:3306 cannot take over all of its siblings. Error: 2018-03-29 12:36:16 ERROR Relocating 1 replicas of mysql3:3306 below mysql2:3306 turns to be too complex; please do it manually

@shlomi-noach
Copy link
Collaborator

No ideas yet. Seems like you're running GTID or pseudo GTID and that it should work.

@shlomi-noach
Copy link
Collaborator

What happens if you relocate "mysql1" under "mysql2"?

@igroene
Copy link
Author

igroene commented Apr 4, 2018

Sorry about the delay, I was out for a couple of days. I am indeed using GTID, and if I move mysql1 under mysql2, then try to promote mysql2 it works as expected.
Nothing useful in the logs unfortunately for the case that fails:

Apr 04 11:30:35 mysql1 orchestrator[4453]: [martini] Started GET /api/graceful-master-takeover/mysql1/3306/mysql2/3306 for 192.168.56.1:50935
Apr 04 11:30:35 mysql1 orchestrator[4453]: 2018-04-04 11:30:35 INFO GracefulMasterTakeover: Will let mysql2:3306 take over its siblings
Apr 04 11:30:35 mysql1 orchestrator[4453]: 2018-04-04 11:30:35 INFO Will move 1 replicas below mysql2:3306 via GTID
Apr 04 11:30:35 mysql1 orchestrator[4453]: 2018-04-04 11:30:35 ERROR Relocating 1 replicas of mysql1:3306 below mysql2:3306 turns to be too complex; please do it manually
Apr 04 11:30:35 mysql1 orchestrator[4453]: [martini] Completed 500 Internal Server Error in 22.606937ms

I did a few more tests and I seem to "randomly" get the "too complex" message with one of the slaves. If I then try to promote the other slave it works.

@shlomi-noach
Copy link
Collaborator

Can you please verify that Auto_position is 1 in show slave status at all times? I suspect perhaps it may be 0, in which case orchestrator cannot actually utilize GTID for failover.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants