Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master_id (runid) should protect from accidental master restarts #2636

Closed
romange opened this issue Feb 21, 2024 · 3 comments · Fixed by #2753
Closed

master_id (runid) should protect from accidental master restarts #2636

romange opened this issue Feb 21, 2024 · 3 comments · Fixed by #2753
Assignees
Labels
enhancement New feature or request important higher priority than the usual ongoing development tasks MANAGED Next Up task that is ready to be worked on and should be added to working queue

Comments

@romange
Copy link
Collaborator

romange commented Feb 21, 2024

Currently we use master_id for two purposes: cluster node id generation and as "master id" during the replication.

  1. Similarly to Redis, we do not protect slave from accidental data flushes during full sync (see http://antirez.com/news/80 for context)
  2. Unlike Redis we currently do not employ partial sync where such protection exists

Not directly related to this issue but important for providing additional context - using master_id as nodeid for cluster management is cumbersome and confusing.

I suggest we implement master_id protection so that replica that had been synced already and reached SSR with the master id A, won't reconnect automatically with master under the same address with master id B. Specifically, one would need to reissue "replicaof .." command again to bootstrap the replication again.

This behavior change on replica side should be under flag with default to preserve the current behaviour.

@romange romange added enhancement New feature or request MANAGED labels Feb 21, 2024
chakaz added a commit that referenced this issue Mar 6, 2024
This flag sets the unique ID of a node in a cluster.

It is UB (and bad) to set the same IDs to multiple nodes in the same
cluster.

If unset (default), the `master_replid` (previously known as `master_id`) is used.

Fixes #2643
Related to #2636
chakaz added a commit that referenced this issue Mar 10, 2024
* feat(cluster): Add `--cluster_id` flag

This flag sets the unique ID of a node in a cluster.

It is UB (and bad) to set the same IDs to multiple nodes in the same
cluster.

If unset (default), the `master_replid` (previously known as `master_id`) is used.

Fixes #2643
Related to #2636

* gh comments

* oops - revert line removal

* fix

* replica

* disallow cluster_node_id in emulated mode

* fix replica test
kostasrim pushed a commit that referenced this issue Mar 11, 2024
* feat(cluster): Add `--cluster_id` flag

This flag sets the unique ID of a node in a cluster.

It is UB (and bad) to set the same IDs to multiple nodes in the same
cluster.

If unset (default), the `master_replid` (previously known as `master_id`) is used.

Fixes #2643
Related to #2636

* gh comments

* oops - revert line removal

* fix

* replica

* disallow cluster_node_id in emulated mode

* fix replica test
@ashotland
Copy link
Contributor

Also consider the case where master is restarted during takeover

@romange romange added important higher priority than the usual ongoing development tasks Next Up task that is ready to be worked on and should be added to working queue labels Mar 19, 2024
@adiholden adiholden assigned chakaz and unassigned adiholden Mar 19, 2024
chakaz added a commit that referenced this issue Mar 20, 2024
Until now, replicas would re-connect and re-replicate a master after the
master will restart. This is problematic in case the master loses its
data, which will cause the replica to flush all and lose its data as
well.

This is a breaking change though, in that whoever controls the replica
now has to explicitly issue a `REPLICAOF X Y` in order to re-establish
a connection to a new master. This is true even if the master loaded an
up to date RDB file.

It's not necessary if the replica lost connection to the master and the
master was always alive, and the connection is re-established.

Fixes #2636
@chakaz
Copy link
Collaborator

chakaz commented Mar 20, 2024

Also consider the case where master is restarted during takeover

I think that Roman's proposed solution should play nicely with a restarting master after takeover: After the master is restarted, it will get a new repl-id, which will mean that even in an edge case where the replica still tries to connect to that master (shouldn't happen, but still) it will not flush its data. Other replicas for that master will, too, not flush their data, but will instead need to be explicitly sent commands to replicate the new master.

chakaz added a commit that referenced this issue Mar 24, 2024
* feat(replication): Do not auto replicate different master

Until now, replicas would re-connect and re-replicate a master after the
master will restart. This is problematic in case the master loses its
data, which will cause the replica to flush all and lose its data as
well.

This is a breaking change though, in that whoever controls the replica
now has to explicitly issue a `REPLICAOF X Y` in order to re-establish
a connection to a new master. This is true even if the master loaded an
up to date RDB file.

It's not necessary if the replica lost connection to the master and the
master was always alive, and the connection is re-established.

Fixes #2636

* fix test

* fixes

* proxy proxy java java

* better comment

* fix comments

* replica_reconnect_on_master_restart

* proxy.close()
@romange
Copy link
Collaborator Author

romange commented Apr 9, 2024

@ashotland it's done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request important higher priority than the usual ongoing development tasks MANAGED Next Up task that is ready to be worked on and should be added to working queue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants