Slaves with broken replication should not be shown by which-cluster-osc-replicas #588

cezmunsta · 2018-08-23T16:19:15Z

The status of the slaves do not appear to be checked ahead of being suggested as candidate controllers for a online schema change (OSC).

The logs contain a message showing that orchestrator is fully aware of the issue:

Aug 23 13:57:15 db1 orchestrator[27247]: 2018-08-23 13:57:15 WARNING executeCheckAndRecoverFunction: ignoring analysisEntry that has no action plan: FirstTierSlaveFailingToConnectToMaster; key: 10.0.0.2:3306

The following relates to a 2-node cluster where the slave is no longer replicating.
When requesting which-cluster-osc-replicas the slave is returned:

$ orchestrator-client -c which-cluster-osc-replicas -a cluster1
10.0.0.2:3306

Here is the live replication status:

> show slave status\G
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
        Seconds_Behind_Master: NULL
                Last_IO_Error: error connecting to master 'repl@10.0.0.1:3306' - retry-time: 60  retries: 153
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
      Last_IO_Error_Timestamp: 180823 16:15:30
1 row in set (0.00 sec)

Only healthy slaves would be expected to be suggested as candidate controllers for an OSC.

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2018-10-07T03:11:32Z

Thank you for the PR and I apologize for this late response, I missed this issue.

There is some risk in only providing "good" replicas. Isn't there purpose to returning bad replicas, too?
Say you begin a migration and one replica breaks. Do you wish to proceed with migration or stop it? I can see the argument to both sides. Can you share your thoughts?

shlomi-noach · 2018-10-17T08:46:12Z

friendly ping

cezmunsta · 2018-10-17T15:33:46Z

Just to clarify the reasoning, imagine this is an OSC to free space on a fragmented table which would save the slaves that are still healthy, whilst some had already met their fate of "No space left on device". I certainly would not want a broken slave to be presented to me for this to be used as a control slave.

Similarly, if all|many slaves were broken then I would not want to add triggers and get a broken control slave that blocks everything and causes extra writes.

The cleanest and simplest solution to the real issue (not knowing a potential control slave is broken from the command) is something like:

$ orchestrator-client -c which-broken-replicas -a cluster1
10.0.0.2:3306

This could then be used in a number of ways, including selective filtering for the control slave. It also moves the logic of continue: yes|no to the consumer, instead of orchestrator.

shlomi-noach · 2018-10-18T10:30:27Z

@cezmunsta which-broken-replicas is a good addition.

Allow me to make things simpler even more, and add a which-cluster-osc-running-replicas command?

otherpirate mentioned this issue Oct 5, 2018

Command which-cluster-osc-replicas show only replication ok nodes #647

Open

4 tasks

cezmunsta mentioned this issue Oct 17, 2018

Added which-broken-replicas to list replicas with errors #660

Merged

shlomi-noach mentioned this issue Oct 18, 2018

Adding which-cluster-osc-running-replicas command #663

Merged

shlomi-noach closed this as completed in #663 Oct 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slaves with broken replication should not be shown by which-cluster-osc-replicas #588

Slaves with broken replication should not be shown by which-cluster-osc-replicas #588

cezmunsta commented Aug 23, 2018

shlomi-noach commented Oct 7, 2018 •

edited

Loading

shlomi-noach commented Oct 17, 2018

cezmunsta commented Oct 17, 2018 •

edited

Loading

shlomi-noach commented Oct 18, 2018

Slaves with broken replication should not be shown by which-cluster-osc-replicas #588

Slaves with broken replication should not be shown by which-cluster-osc-replicas #588

Comments

cezmunsta commented Aug 23, 2018

shlomi-noach commented Oct 7, 2018 • edited Loading

shlomi-noach commented Oct 17, 2018

cezmunsta commented Oct 17, 2018 • edited Loading

shlomi-noach commented Oct 18, 2018

shlomi-noach commented Oct 7, 2018 •

edited

Loading

cezmunsta commented Oct 17, 2018 •

edited

Loading