GTID not found properly (5.7) and some graceful-master-takeover issues #78

ecortestws · 2017-02-14T13:22:34Z

Hi,

I am testing orchestrator with 5.7.17, Master and two slaves. Have moved one of the slaves to change the topology like A-B-C and then executed orchestrator -c graceful-master-takeover -alias myclusteralias

The issues found are:

GTID appears as disabled in the master, the web interface shows the button to enable it, when obviously it is enabled in all the replication chain (GTID_MODE=ON). Slaves are showed with GTID enabled.
This issue causes that the takeover doesn't use GTID (I guess)
Instance B was in read-only before the takeover, after the takeover, the read-only is not disabled, is this a feature or something that should I add via hooks? Should be nice to have a parameter to end the process in the status that you prefer, depending on the takeover reasons/conditions.
Also, for any reason the role change old-master-> new slave doesn't work. It executes a CHANGE MASTER but apparently the replication username in the old master is empty, failing the change master operation (orchestrator user has SELECT ON mysql.slave_master_info in the cluster).
Finally, should be nice to add a feature to force to refactor the topology when you have one master and several slaves below. It requires moving slaves below the new elected master, just before the master-takeover. The process will take a bit longer, moving the slaves, and waiting until they are ready.

Thanks for this amazing tool!
Regards,
Eduardo

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2017-02-14T15:34:49Z

GTID appears as disabled in the master, the web interface shows the button to enable it, when obviously it is enabled in all the replication chain (GTID_MODE=ON)

Can you please issue:

select @@global.gtid_mode, @@global.gtid_purged

on your master?

This issue causes that the takeover doesn't use GTID (I guess)

That makes sense. We need to solve the GTID recognition problem.

Instance B was in read-only before the takeover, after the takeover, the read-only is not disabled, is this a feature or something that should I add via hooks? Should be nice to have a parameter to end the process in the status that you prefer, depending on the takeover reasons/conditions.

By default orchestrator does not RESET SLAVE ALL and does not SET GLOBAL read_only=0. To do both, set ApplyMySQLPromotionAfterMasterFailover: true

I apologize for the inconvenient name. I'll be working to minimize the number of configuration params.

Also, for any reason the role change old-master-> new slave doesn't work. It executes a CHANGE MASTER but apparently the replication username in the old master is empty, failing the change master operation (orchestrator user has SELECT ON mysql.slave_master_info in the cluster).

As per #57:

What's missing in this story is the MASTER_USER and MASTER_PASSWORD, which are likely to not exist, because the old master was like not having replication info.
So that leads to the case where even after positioning, the old master cant truly replicate from the promoted master. Nonetheless, it is placed in the correct position to assume replication once credential settings are applied.

The problem is orchestrator doesn't have the username & password of your replication user.

Finally, should be nice to add a feature to force to refactor the topology when you have one master and several slaves below. It requires moving slaves below the new elected master, just before the master-takeover. The process will take a bit longer, moving the slaves, and waiting until they are ready.

This can be easily scripted on the user's side. I really think that in the event of planned takeover the user should choose the identity of the new master. If orchestrator were to choose the identity -- fine, but no promises held that everything would work. Perhaps your setup is such where the promoted server would not be the one you'd expect.
You may find such statement confusing. Your own setup may be simple enough, but there are various setups that are not as simple to deal with: servers with no log-slave-updates (can happen with 5.7 GTID), a mixture of 5.6 and 5.7 etc.
Some servers may not be able to grab the VIP the current master has. Or are in an unreliable physical location. Please understand orchestrator has "seen it all" and much of its behavior is crafted by experiencing non-trivial scenarios.

To this end, when things go bad, orchestrator is very smart in making the best of a situation. But at planned failovers, it would very much like you to set up your topology in a way that makes sense to you and will guarantee survival of all servers you care about.

ecortestws · 2017-02-14T16:17:57Z

GTID data from the master:

mysql>  select @@global.gtid_mode, @@global.gtid_purged\G
*************************** 1. row ***************************
@@global.gtid_mode: ON
@@global.gtid_purged: 17255cd9-b2f6-11e6-b59d-005056946d8b:1-15546,
604d9088-a5c6-11e6-8f72-005056945836:1-8796013,
9cb4118b-a5c6-11e6-96c0-005056945189:1-37645
1 row in set (0.00 sec)

There are so many options that didn't see that. Will check it, thanks!
About username/password issue, orchestrator can read username/password from the new master just before the take over, and use them with the old master demote. Otherwise, some kind of warning about "no credentials found" or something else should be useful.
You are right about planned thing, is just that reading the documentation I read that orchestrator could require several steps to finish in the target state. This would be that scenario, and the only requirement could be that you specify the new master rather than let orchestrator to select it. As you said, can be done at user's side :-)

shlomi-noach · 2017-02-14T19:27:45Z

GTID

looking into!

About username/password issue, orchestrator can read username/password from the new master just before the take over

It cannot. You cannot reveal the password by SHOW SLAVE STATUS. There is a potential solution (utilized by orchestrator) in the event you use system tables for master-info.

is just that reading the documentation I read that orchestrator could require several steps to finish in the target state

More than anything, I'd appreciate help with documentation!

sjmudd · 2017-02-14T22:38:09Z

@shlomi-noach: slave username and password information are available via mysql.slave_master_info:

root@somehost [mysql]> select Host, Port,  User_name, User_password from slave_master_info;
+-----------------------+------+-----------+---------------+
| Host                  | Port | User_name | User_password |
+-----------------------+------+-----------+---------------+
| somehost.mydomain.com | 3306 | some_user | some_password |
+-----------------------+------+-----------+---------------+
1 row in set (0.00 sec)

So in theory you could try these credentials. However, depending on existing grants this may or many not work as expected as the grant may be for 'some_user'@'%' (any address), 'some_user'@'192.168.9.10' (specific address) or any combination between, some of which may work and others may not.

It may be worth having an option to try or check the configuration but specific site configs may vary.

For what it's worth for planned topology changes (of the master) I don't use orchestrator but custom scripts. This gives a bit more control and reduces downtime, but orchestrator is nearly always used manually both before and afterwards to arrange the topology as needed to minimise the impact of the master changeover. I guess I could use orchestrator and most of what's described here is what I do already but I have more freedom to check stuff both before and afterwards which makes me feel more comfortable. Maybe I need to look again at how well orchestrator handles this task as it simplifies things if the amount of software used is reduced.

ecortestws · 2017-02-14T23:28:14Z

That's right, orchestrator user has select permissions on slave_master_info and the information is there, just need to read it and use it. If you expect that orchestrator executes this task cleanly, the user should ensure that replication user has permissions in all the nodes involved.

shlomi-noach · 2017-02-15T06:41:50Z

Reading credentials from slave_master_info is already implemented for make-co-master so should be easy to apply to graceful-takeover

https://github.com/github/orchestrator/blob/55cedffe8da1163df6d6d2374207cae97ae375fe/go/inst/instance_topology.go#L900-L911

fuyar · 2017-02-28T09:52:51Z

Hello, just wanted to report the same issue with GTID based replication (Percona 5.7.16) not being recognized on a simple master - 4 slaves topology.

On the master :

+--------------------+----------------------+
| @@global.gtid_mode | @@global.gtid_purged |
+--------------------+----------------------+
| ON                 |                      |
+--------------------+----------------------+
1 row in set (0.00 sec)

shlomi-noach · 2017-02-28T10:03:23Z

@fuyar thank you!

fuyar · 2017-03-01T09:47:46Z

no problem @shlomi-noach :)

Seems like Orchestrator was finally able to detect GTIDs on the 4 slaves (I rechecked this morning while doing nothing previously).

oracle_gtid: 0 still for the master in the 'database_instance' table but yeah as the master is not a slave of anyone it should be ok I suppose ?

shlomi-noach · 2017-03-01T10:43:17Z

but yeah as the master is not a slave of anyone it should be ok I suppose ?

~~That's the very bug; because the master is not identified as gtid-enabled, orchestrator doesn't run a gtid-based failover.~~

shlomi-noach · 2017-03-01T11:29:35Z

OK, have taken a closer look into GTID recoveries:

The fact the master does not show as oracle_gtid in database_instance, or that it shows as GTID based replication: false is irrelevant to the failover mechanism.
The failover looks at the set of replicas and determines that they're using GTIDs ; GTID-based failover takes place when there's at least one GTID replica, and when all valid replicas (valid == responsive) are GTID. In other words, if a single valid replica is not a GTID based replica, then failover is not GTID based. This shouldn't happen in reality, but added as a safety mechanism in the unlikelihood that GTID->non-GID replication is made possible in the far far future.

shlomi-noach · 2017-03-01T11:33:13Z

@ecortestws my last comment suggests that:

This issue causes that the takeover doesn't use GTID (I guess)

is wrong. Are you able to show that the recovery was not based on GTID? I do mean it's a completely valid assumption on your side, but I believe is incorrect. The logs actually specify the type of recovery. Look for:

topology_recovery: RecoverDeadMaster: masterRecoveryType=...

I realize that was 15 days ago and you may not have the logs at this time.

ecortestws · 2017-03-02T08:53:14Z

Hi @shlomi-noach:
from the logs:
2017-02-14 08:15:34 DEBUG topology_recovery: RecoverDeadMaster: masterRecoveryType=MasterRecoveryPseudoGTID

shlomi-noach · 2017-03-02T09:16:55Z

@ecortestws thank you. Then, indeed, orchestrator didn't recognize this to be a GTID recovery.

shlomi-noach · 2017-03-06T12:04:34Z

Applying replication-credentials on demoted master is addressed by #93

ecortestws · 2017-03-20T14:18:36Z

Hi @shlomi-noach,
any progress on GTID issue?
Thanks
Eduardo

shlomi-noach · 2017-03-20T14:20:16Z

@ecortestws Perfect timing. I am setting up an environment for this now.

shlomi-noach · 2017-03-21T06:36:23Z

@ecortestws can you confirm your servers are Percona Server? If so, this is identified in #96 and solved via #98 (no release yet)

My current GTID testing environment is happily identifying GTID topologies.

#106 makes the web interface recognize a GTID master as "using GTID" -- but this is a visualization matter only; recoveries are using a lower level logic.

shlomi-noach · 2017-03-21T07:03:48Z

@ecortestws can you test https://github.com/github/orchestrator/releases/tag/v2.1.0 ?

ecortestws · 2017-03-21T09:41:19Z

@shlomi-noach my servers are Oracle MySQL. Will try the new release and let you know.

ecortestws · 2017-03-21T12:38:20Z

@shlomi-noach I have tested it but it didn't work as expected.

orchestrator -version
2.1.0
05241ab2608de7ed5dd66a363690a33db36e9954

2017-02-14 08:15:34 DEBUG topology_recovery: RecoverDeadMaster: masterRecoveryType=MasterRecoveryPseudoGTID
2017-02-14 08:15:35 INFO ChangeMasterTo: Changed master on 10.102.92.162:3306 to: 10.102.92.161:3306, bin-log.000256:80539410. GTID: false 10.102.92.161:3306

ecortestws · 2017-03-21T12:39:56Z

The web interface now shows GTID enabled in the master.

shlomi-noach · 2017-03-21T12:47:53Z

@ecortestws thank you. Are you again looking at a A-B-C chain with graceful-master-takeover?

I'll run some more checks and may come back with more questions.

ecortestws · 2017-03-21T12:53:49Z

@shlomi-noach yes, the same approach, the same topology. Moved C from A to B before the takeover, and verified that the replication chain was healthy. I have all the logs, let me know if you need anything else. I understand that #93 hasn't been merged yet, so the issue with the credentials after the takeover is expected. Thanks.

shlomi-noach · 2017-03-22T06:33:33Z

#93 is now merged

shlomi-noach · 2017-03-26T07:51:26Z

@ecortestws I'm happy if you can share the logs. If they contain sensitive data, can you please share them with me via email? My address is shlomi-noach@-youknowhichcompany-.com

shlomi-noach · 2017-03-28T07:27:30Z

OK I'm able to reproduce this.

The reason this happens: the auto_position is not set by default, and orchestrator uses that to recognize GTID replication. I'm looking into improving this.

shlomi-noach · 2017-03-28T10:03:58Z

@ecortestws can you please confirm https://github.com/github/orchestrator/releases/tag/v2.1.1-BETA works for you?

Make sure that the replicas are on auto_position=1, as this is a requirement for a GTID-based recovery.

ecortestws · 2017-03-28T12:14:51Z

@shlomi-noach it worked, but the replication was not started in the demoted master. Is it a expected behavior? The credentials were in place, and after execute "START SLAVE" in the old master it started syncing with the new master.

shlomi-noach · 2017-03-29T04:52:40Z

@ecortestws This is expected behavior. I see advantages and reasons for both starting and not starting replication automatically; "not starting" is on the safer side.

shlomi-noach self-assigned this Feb 14, 2017

shlomi-noach mentioned this issue Mar 6, 2017

graceful-master-takeover attempts setting replication credentials #93

Merged

shlomi-noach mentioned this issue Mar 21, 2017

Identify gtid #106

Merged

shlomi-noach mentioned this issue Mar 28, 2017

GTID based recoveries, graceful takeover fix #121

Merged

shlomi-noach closed this as completed in #121 Mar 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTID not found properly (5.7) and some graceful-master-takeover issues #78

GTID not found properly (5.7) and some graceful-master-takeover issues #78

ecortestws commented Feb 14, 2017 •

edited

Loading

shlomi-noach commented Feb 14, 2017

ecortestws commented Feb 14, 2017

shlomi-noach commented Feb 14, 2017

sjmudd commented Feb 14, 2017

ecortestws commented Feb 14, 2017

shlomi-noach commented Feb 15, 2017

fuyar commented Feb 28, 2017

shlomi-noach commented Feb 28, 2017

fuyar commented Mar 1, 2017

shlomi-noach commented Mar 1, 2017 •

edited

Loading

shlomi-noach commented Mar 1, 2017

shlomi-noach commented Mar 1, 2017

ecortestws commented Mar 2, 2017

shlomi-noach commented Mar 2, 2017

shlomi-noach commented Mar 6, 2017

ecortestws commented Mar 20, 2017

shlomi-noach commented Mar 20, 2017

shlomi-noach commented Mar 21, 2017

shlomi-noach commented Mar 21, 2017

ecortestws commented Mar 21, 2017

ecortestws commented Mar 21, 2017

ecortestws commented Mar 21, 2017

shlomi-noach commented Mar 21, 2017

ecortestws commented Mar 21, 2017

shlomi-noach commented Mar 22, 2017

shlomi-noach commented Mar 26, 2017 •

edited

Loading

shlomi-noach commented Mar 28, 2017

shlomi-noach commented Mar 28, 2017

ecortestws commented Mar 28, 2017

shlomi-noach commented Mar 29, 2017

GTID not found properly (5.7) and some graceful-master-takeover issues #78

GTID not found properly (5.7) and some graceful-master-takeover issues #78

Comments

ecortestws commented Feb 14, 2017 • edited Loading

shlomi-noach commented Feb 14, 2017

ecortestws commented Feb 14, 2017

shlomi-noach commented Feb 14, 2017

sjmudd commented Feb 14, 2017

ecortestws commented Feb 14, 2017

shlomi-noach commented Feb 15, 2017

fuyar commented Feb 28, 2017

shlomi-noach commented Feb 28, 2017

fuyar commented Mar 1, 2017

shlomi-noach commented Mar 1, 2017 • edited Loading

shlomi-noach commented Mar 1, 2017

shlomi-noach commented Mar 1, 2017

ecortestws commented Mar 2, 2017

shlomi-noach commented Mar 2, 2017

shlomi-noach commented Mar 6, 2017

ecortestws commented Mar 20, 2017

shlomi-noach commented Mar 20, 2017

shlomi-noach commented Mar 21, 2017

shlomi-noach commented Mar 21, 2017

ecortestws commented Mar 21, 2017

ecortestws commented Mar 21, 2017

ecortestws commented Mar 21, 2017

shlomi-noach commented Mar 21, 2017

ecortestws commented Mar 21, 2017

shlomi-noach commented Mar 22, 2017

shlomi-noach commented Mar 26, 2017 • edited Loading

shlomi-noach commented Mar 28, 2017

shlomi-noach commented Mar 28, 2017

ecortestws commented Mar 28, 2017

shlomi-noach commented Mar 29, 2017

ecortestws commented Feb 14, 2017 •

edited

Loading

shlomi-noach commented Mar 1, 2017 •

edited

Loading

shlomi-noach commented Mar 26, 2017 •

edited

Loading