Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Within autorejoin-script Script not executed #453

Closed
dibrother opened this issue Oct 19, 2022 · 6 comments
Closed

Within autorejoin-script Script not executed #453

dibrother opened this issue Oct 19, 2022 · 6 comments
Assignees
Milestone

Comments

@dibrother
Copy link

dibrother commented Oct 19, 2022

  • replication-manager-osc-2.2.27
############
## REJOIN ##
############

autorejoin = true
autorejoin-script = "/etc/replication-manager/scrpit/rejoin.sh"
autorejoin-semisync = true
autorejoin-backup-binlog = true
autorejoin-flashback = false
autorejoin-mysqldump = false

When the old master is started, it is automatically added to the topology,The script set in autorejoin-script script is not executed。Unable to find log Calling autorejoin-script 。

log:

2022/10/19 09:47:45     STATE   RESOLV WARN0091 : Server as errant transaction 192.168.66.166:3312
2022/10/19 09:47:45     STATE   RESOLV ERR00041 : Skip slave in election 192.168.66.166:3312 has more than 30 seconds of replication delay (166) Sql:
2022/10/19 09:47:45     INFO    Server 192.168.66.166:3312 previous state changed to: Slave
2022/10/19 09:47:45     INFO    Server 192.168.66.166:3312 state transition SlaveLate changed to: Slave
2022/10/19 09:47:43     STATE   OPENED WARN0091 : Server as errant transaction 192.168.66.166:3312
2022/10/19 09:47:43     STATE   OPENED ERR00041 : Skip slave in election 192.168.66.166:3312 has more than 30 seconds of replication delay (166) Sql:
2022/10/19 09:47:43     STATE   RESOLV WARN0022 : Rejoining standalone server 192.168.66.166:3312 to master 192.168.66.167:3312
2022/10/19 09:47:43     INFO    Server 192.168.66.166:3312 previous state changed to: SlaveLate
2022/10/19 09:47:43     INFO    Server 192.168.66.166:3312 state transition StandAlone changed to: SlaveLate
2022/10/19 09:47:41     STATE   OPENED WARN0022 : Rejoining standalone server 192.168.66.166:3312 to master 192.168.66.167:3312
2022/10/19 09:47:41     INFO    Server 192.168.66.166:3312 previous state changed to: StandAlone
2022/10/19 09:47:41     INFO    Rejoin old Master 192.168.66.166:3312 , backing up lost event to /var/lib/replication-manager/cluster1/crash-bin-20221019094741
2022/10/19 09:47:41     INFO    Doing MySQL GTID switch of the old master
2022/10/19 09:47:41     INFO    Found same or lower GTID 0-8398350563552829290-425,0-4764424619230958061-3248 and new elected master was 0-8398350563552829290-425,0-4764424619230958061-3248
2022/10/19 09:47:41     INFO    98B091AD-4EBE-11ED-BDBE-000C2939862D 192.168.66.166:3312
2022/10/19 09:47:41     INFO    98B091AD-4EBE-11ED-BDBE-000C2939862D 192.168.66.166:3312
2022/10/19 09:47:41     INFO    98B091AD-4EBE-11ED-BDBE-000C2939862D 192.168.66.166:3312
2022/10/19 09:47:41     INFO    Rejoin server found a crash GTID greater than 0
2022/10/19 09:47:41     INFO    Crash Saved GTID sequence 3248 from server id 4764424619230958061
2022/10/19 09:47:41     INFO    98B091AD-4EBE-11ED-BDBE-000C2939862D 192.168.66.166:3312
2022/10/19 09:47:41     INFO    98B091AD-4EBE-11ED-BDBE-000C2939862D 192.168.66.166:3312
2022/10/19 09:47:41     INFO    Rejoined GTID sequence 3248 from server id 4764424619230958061
2022/10/19 09:47:41     INFO    98B091AD-4EBE-11ED-BDBE-000C2939862D 192.168.66.166:3312
2022/10/19 09:47:41     INFO    98B091AD-4EBE-11ED-BDBE-000C2939862D 192.168.66.166:3312
2022/10/19 09:47:41     INFO    Crash info &{192.168.66.166:3312 mysql-bin.000008 219518740 mysql-bin.000005 219506244 %!s(bool=true) %!s(*gtid.List=&[{0 8398350563552829290 425} {0 4764424619230958061 3248}]) 192.168.66.167:3312}
2022/10/19 09:47:41     INFO    Rejoin master incremental 192.168.66.166:3312
2022/10/19 09:47:41     INFO    Backup /usr/local/mysql/bin/mysqlbinlog /usr/local/mysql/bin/mysqlbinlog --read-from-remote-server --raw --stop-never-slave-server-id=10000 --user=rep_monitor --password=XXXX --host=192.168.66.166 --port=3312 --result-file=/var/lib/replication-manager/cluster1-server66166- --start-position=219518740 mysql-bin.000008
2022/10/19 09:47:41     INFO    Backup ahead binlog events of previously failed server 192.168.66.166:3312
2022/10/19 09:47:41     INFO    Server 192.168.66.166:3312 state transition Failed changed to: StandAlone
2022/10/19 09:47:41     INFO    Setting Read Only on unconnected server 192.168.66.166:3312 as active monitor and other master is discovered
2022/10/19 09:47:41     INFO    State changed, init failed server 192.168.66.166:3312 as unconnected
2022/10/19 09:47:38     INFO    Server refresh failed but ping connect driver: bad connection
2022/10/19 09:45:02     STATE   RESOLV WARN0023 : Number of failed master ping has been reached
2022/10/19 09:45:02     INFO    Server 192.168.66.167:3312 previous state changed to: Master
2022/10/19 09:44:59     STATE   OPENED WARN0023 : Number of failed master ping has been reached
2022/10/19 09:44:59     INFO    Master switch on 192.168.66.167:3312 complete
2022/10/19 09:44:59     INFO    Change master on slave 192.168.66.168:3312
2022/10/19 09:44:59     INFO    Switching other slaves to the new master
@dibrother
Copy link
Author

When you want to modify the old master parameter in the autorejoin
with: set global rpl_ semi_ sync_ slave_ enabled=1; set global rpl_ semi_ sync_ master_ enabled=0;
How to achieve ?

@svaroqui
Copy link
Collaborator

Hello, I don't think this is needed i always set both values to ON in config so that that the config stay symmetric between the nodes can you confirm this works for MYSQL 8 as Well ?

Indeed the rejoin script is only call on non incremental methods may be i can start with running the script if defined first

func (server *ServerMonitor) RejoinMasterSST() error {
	if server.ClusterGroup.Conf.AutorejoinMysqldump == true {
		server.ClusterGroup.LogPrintf("INFO", "Rejoin flashback dump restore %s", server.URL)
		err := server.RejoinDirectDump()
		if err != nil {
			server.ClusterGroup.LogPrintf("ERROR", "mysqldump flashback restore failed %s", err)
			return errors.New("Dump from master failed")
		}
	} else if server.ClusterGroup.Conf.AutorejoinLogicalBackup {
		server.JobFlashbackLogicalBackup()
	} else if server.ClusterGroup.Conf.AutorejoinPhysicalBackup {
		server.JobFlashbackPhysicalBackup()
	} else if server.ClusterGroup.Conf.AutorejoinZFSFlashback {
		server.RejoinPreviousSnapshot()
	} else if server.ClusterGroup.Conf.RejoinScript != "" {
		server.ClusterGroup.LogPrintf("INFO", "Calling rejoin flashback script")
		var out []byte
		out, err := exec.Command(server.ClusterGroup.Conf.RejoinScript, misc.Unbracket(server.Host), misc.Unbracket(server.ClusterGroup.master.Host)).CombinedOutput()
		if err != nil {
			server.ClusterGroup.LogPrintf("ERROR", "%s", err)
		}
		server.ClusterGroup.LogPrintf("INFO", "Rejoin script complete %s", string(out))
	} else {
		server.ClusterGroup.LogPrintf("INFO", "No SST rejoin method found")
		return errors.New("No SST rejoin flashback method found")
	}

	return nil
} 

@dibrother
Copy link
Author

Whether at 8 or 5,if master set :

rpl_ semi_ sync_ master_ enabled=1; 
rpl_ semi_ sync_ slave_ enabled=0;
rpl_semi_sync_source_timeout = 10000000;    ## large value

--- Failover occurs
now replication-manager set

autorejoin = true

when old master startup ,replication-manager will set it as a slave,but rpl_ semi_ sync_ master_ enabled value remains 1;Due to rpl_ semi_ sync_ source_timeout value is large,this will cause the slave to be in the state of waiting for ACK

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 192.168.66.167
                  Master_User: rep_monitor
                  Master_Port: 3312
                Connect_Retry: 5
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 113789324
               Relay_Log_File: relay.000003
                Relay_Log_Pos: 3876154
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 112411472
              Relay_Log_Space: 5579430
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 51
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 66167
                  Master_UUID: 645a456e-4f79-11ed-bee1-000c294448db
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 645a456e-4f79-11ed-bee1-000c294448db:1-17071
            Executed_Gtid_Set: 645a456e-4f79-11ed-bee1-000c294448db:1-12854,      ##  It's blocked here
c7665fbe-4f77-11ed-b6d6-000c2939862d:1-1068
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
       Master_public_key_path: 
        Get_master_public_key: 0
            Network_Namespace: 
1 row in set, 1 warning (0.00 sec)


##  Waiting for semi-sync ACK from replica

mysql> show processlist;
+-----+-------------+----------------------+------+---------+------+---------------------------------------------+------------------+
| Id  | User        | Host                 | db   | Command | Time | State                                       | Info             |
+-----+-------------+----------------------+------+---------+------+---------------------------------------------+------------------+
|   7 | rep_monitor | 192.168.66.168:60160 | NULL | Sleep   |    1 |                                             | NULL             |
|  70 | system user | connecting host      | NULL | Connect | 1164 | Waiting for source to send event            | NULL             |
|  71 | system user |                      | NULL | Query   |   33 | Waiting for dependent transaction to commit | NULL             |
|  72 | system user |                      | NULL | Query   |   33 | Waiting for semi-sync ACK from replica      | NULL             |
|  73 | system user |                      | NULL | Connect | 1164 | Waiting for an event from Coordinator       | NULL             |
|  74 | system user |                      | NULL | Connect | 1164 | Waiting for an event from Coordinator       | NULL             |
|  75 | system user |                      | NULL | Connect | 1164 | Waiting for an event from Coordinator       | NULL             |
| 600 | root        | localhost            | NULL | Query   |    0 | init                                        | show processlist |
+-----+-------------+----------------------+------+---------+------+---------------------------------------------+------------------+
8 rows in set (0.00 sec)

@dibrother
Copy link
Author

Is there any other way to avoid this problem? Or where to call script settings

@svaroqui
Copy link
Collaborator

Thanks for reporting i never thought of one using such big timeout , i will push many solutions stay tune

  • Add a set global rpl_ semi_ sync_ master_ enabled=0; in rejoin sequence
  • Enable to run rejoin script what ever is the restore scenario by introducing a backup-load-scipt and backup-save-script so that the actual script used in SST would be such new parameter and the run the rejoin script before any other rejoin strategy take place

svaroqui added a commit that referenced this issue Oct 20, 2022
… during failover, rejoin, switchover. It's not enable per default #453

autorejoin-script is now executed what ever rejoin method used by replication
backup-save-script and backup-load-script can be used to shortcut replication-manager backup restore features parameters host, leader_host, port, leader_port
@svaroqui
Copy link
Collaborator

Can you please try out 2.2.29 added all the options needed to fix into the commit , but i can't make a default as it's a MySQL bug that with long timeout it wait for ACK for ever as what would happen on multi master ou multi source setup when a node role as a leader and replica. Please reopen if needed

@svaroqui svaroqui self-assigned this Oct 20, 2022
@svaroqui svaroqui added this to the 2.2.29 milestone Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants