master_id (runid) should protect from accidental master restarts #2636

romange · 2024-02-21T10:30:57Z

Currently we use master_id for two purposes: cluster node id generation and as "master id" during the replication.

Similarly to Redis, we do not protect slave from accidental data flushes during full sync (see http://antirez.com/news/80 for context)
Unlike Redis we currently do not employ partial sync where such protection exists

Not directly related to this issue but important for providing additional context - using master_id as nodeid for cluster management is cumbersome and confusing.

I suggest we implement master_id protection so that replica that had been synced already and reached SSR with the master id A, won't reconnect automatically with master under the same address with master id B. Specifically, one would need to reissue "replicaof .." command again to bootstrap the replication again.

This behavior change on replica side should be under flag with default to preserve the current behaviour.

This flag sets the unique ID of a node in a cluster. It is UB (and bad) to set the same IDs to multiple nodes in the same cluster. If unset (default), the `master_replid` (previously known as `master_id`) is used. Fixes #2643 Related to #2636

* feat(cluster): Add `--cluster_id` flag This flag sets the unique ID of a node in a cluster. It is UB (and bad) to set the same IDs to multiple nodes in the same cluster. If unset (default), the `master_replid` (previously known as `master_id`) is used. Fixes #2643 Related to #2636 * gh comments * oops - revert line removal * fix * replica * disallow cluster_node_id in emulated mode * fix replica test

ashotland · 2024-03-14T09:25:15Z

Also consider the case where master is restarted during takeover

Until now, replicas would re-connect and re-replicate a master after the master will restart. This is problematic in case the master loses its data, which will cause the replica to flush all and lose its data as well. This is a breaking change though, in that whoever controls the replica now has to explicitly issue a `REPLICAOF X Y` in order to re-establish a connection to a new master. This is true even if the master loaded an up to date RDB file. It's not necessary if the replica lost connection to the master and the master was always alive, and the connection is re-established. Fixes #2636

chakaz · 2024-03-20T12:31:10Z

Also consider the case where master is restarted during takeover

I think that Roman's proposed solution should play nicely with a restarting master after takeover: After the master is restarted, it will get a new repl-id, which will mean that even in an edge case where the replica still tries to connect to that master (shouldn't happen, but still) it will not flush its data. Other replicas for that master will, too, not flush their data, but will instead need to be explicitly sent commands to replicate the new master.

* feat(replication): Do not auto replicate different master Until now, replicas would re-connect and re-replicate a master after the master will restart. This is problematic in case the master loses its data, which will cause the replica to flush all and lose its data as well. This is a breaking change though, in that whoever controls the replica now has to explicitly issue a `REPLICAOF X Y` in order to re-establish a connection to a new master. This is true even if the master loaded an up to date RDB file. It's not necessary if the replica lost connection to the master and the master was always alive, and the connection is re-established. Fixes #2636 * fix test * fixes * proxy proxy java java * better comment * fix comments * replica_reconnect_on_master_restart * proxy.close()

romange · 2024-04-09T14:09:24Z

@ashotland it's done

romange added enhancement New feature or request MANAGED labels Feb 21, 2024

romange assigned adiholden Feb 21, 2024

romange mentioned this issue Feb 22, 2024

persistent node ids in cluster mode #2643

Closed

chakaz mentioned this issue Mar 6, 2024

feat(cluster): Add --cluster_id flag #2695

Merged

romange added important higher priority than the usual ongoing development tasks Next Up task that is ready to be worked on and should be added to working queue labels Mar 19, 2024

adiholden assigned chakaz and unassigned adiholden Mar 19, 2024

chakaz mentioned this issue Mar 20, 2024

feat(replication): Do not auto replicate different master #2753

Merged

chakaz closed this as completed in #2753 Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master_id (runid) should protect from accidental master restarts #2636

master_id (runid) should protect from accidental master restarts #2636

romange commented Feb 21, 2024

ashotland commented Mar 14, 2024

chakaz commented Mar 20, 2024

romange commented Apr 9, 2024

master_id (runid) should protect from accidental master restarts #2636

master_id (runid) should protect from accidental master restarts #2636

Comments

romange commented Feb 21, 2024

ashotland commented Mar 14, 2024

chakaz commented Mar 20, 2024

romange commented Apr 9, 2024