Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Slaves with broken replication should not be shown by which-cluster-osc-replicas #588

Closed
cezmunsta opened this issue Aug 23, 2018 · 4 comments

Comments

@cezmunsta
Copy link
Contributor

The status of the slaves do not appear to be checked ahead of being suggested as candidate controllers for a online schema change (OSC).

The logs contain a message showing that orchestrator is fully aware of the issue:

Aug 23 13:57:15 db1 orchestrator[27247]: 2018-08-23 13:57:15 WARNING executeCheckAndRecoverFunction: ignoring analysisEntry that has no action plan: FirstTierSlaveFailingToConnectToMaster; key: 10.0.0.2:3306

The following relates to a 2-node cluster where the slave is no longer replicating.
When requesting which-cluster-osc-replicas the slave is returned:

$ orchestrator-client -c which-cluster-osc-replicas -a cluster1
10.0.0.2:3306

Here is the live replication status:

> show slave status\G
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
        Seconds_Behind_Master: NULL
                Last_IO_Error: error connecting to master 'repl@10.0.0.1:3306' - retry-time: 60  retries: 153
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
      Last_IO_Error_Timestamp: 180823 16:15:30
1 row in set (0.00 sec)

Only healthy slaves would be expected to be suggested as candidate controllers for an OSC.

@shlomi-noach
Copy link
Collaborator

shlomi-noach commented Oct 7, 2018

Thank you for the PR and I apologize for this late response, I missed this issue.

There is some risk in only providing "good" replicas. Isn't there purpose to returning bad replicas, too?
Say you begin a migration and one replica breaks. Do you wish to proceed with migration or stop it? I can see the argument to both sides. Can you share your thoughts?

@shlomi-noach
Copy link
Collaborator

friendly ping

@cezmunsta
Copy link
Contributor Author

cezmunsta commented Oct 17, 2018

Just to clarify the reasoning, imagine this is an OSC to free space on a fragmented table which would save the slaves that are still healthy, whilst some had already met their fate of "No space left on device". I certainly would not want a broken slave to be presented to me for this to be used as a control slave.

Similarly, if all|many slaves were broken then I would not want to add triggers and get a broken control slave that blocks everything and causes extra writes.

The cleanest and simplest solution to the real issue (not knowing a potential control slave is broken from the command) is something like:

$ orchestrator-client -c which-broken-replicas -a cluster1
10.0.0.2:3306

This could then be used in a number of ways, including selective filtering for the control slave. It also moves the logic of continue: yes|no to the consumer, instead of orchestrator.

@shlomi-noach
Copy link
Collaborator

@cezmunsta which-broken-replicas is a good addition.

Allow me to make things simpler even more, and add a which-cluster-osc-running-replicas command?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants