Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cluster-spec with Valkey 8.0 cluster improvements #167

Merged
merged 10 commits into from
Sep 12, 2024
9 changes: 9 additions & 0 deletions commands/cluster-setslot.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,12 @@ Notes:
If the source node is informed before the destination node and the destination node crashes before it is set as new slot owner, the slot is left with no owner, even after a successful failover.
* Step 6, sending `SETSLOT` to the nodes not involved in the resharding, is not technically necessary since the configuration will eventually propagate itself.
However, it is a good idea to do so in order to stop nodes from pointing to the wrong node for the hash slot moved as soon as possible, resulting in less redirections to find the right node.
* Starting from Valkey 8.0, `CLUSTER SETSLOT` is synchronously replicated to all healthy replicas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to put synchronous replication anywhere in the docs. It's not technically synchronous since we don't revert it on failures, so it's only one copy of the data. I would prefer saying "is replicated to all healthy replicas before being executed on the primaries.

Copy link
Member Author

@PingXie PingXie Sep 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't revert it on failures, so it's only one copy of the data.

I don't think this is the common definition of "synchronous" replication. I modeled our "synchronous" replication after postgress'. The "revert" part falls in the realm of "distributed transactions" and it is a step up from "synchronous replication".

I would prefer saying "is replicated to all healthy replicas before being executed on the primaries.

This is also not accurate nor complete because we do wait for a certain number of replicas to ack the replication and this is the core part of this feature offering.

Copy link
Member

@madolson madolson Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the common definition of "synchronous" replication. I modeled our "synchronous" replication after postgress'. The "revert" part falls in the realm of "distributed transactions" and it is a step up from "synchronous replication".

In postgreSQL, everything is a transaction though. There is no "step up" so to speak, they are the same. I also was using my history with PostgreSQL to object to the wording, since I think the database world provides a much stronger guarantee than we are providing.

This is also not accurate nor complete because we do wait for a certain number of replicas to ack the replication and this is the core part of this feature offering.

"Is replicated and acknlowedged on all healthy replicas before being executed on the primary"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In postgreSQL, everything is a transaction though.

I am not aware that Postgress' replication is done transactionally based on my research. My understanding is that there is no rollback on other replicas when one replica fails to ack the replication either. It is just that the primary will wait for replicas to ack the replication. Note that in order to achieve transactional replication, one would need to implement 2PC, which trades (more) performance/availability for consistency. Do you have a Postgress code or doc pointer that indicates their repliction is done transactionally? I can dig some more too.

"Is replicated and acknlowedged on all healthy replicas before being executed on the primary"?

Yes.

The "synchronous" qualifier is critical here and it is used to contrast the current "asynchronous" replication, which doesn't provide the required semantics.

BTW, not related to Postgress, but "synchronous replication" is also used in the exact same semantics as I used here for another database I worked on before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the word "synchronous" to make everyone happy? The concern is that someone sees "synchronous replication" out of context and posts a blog post "Valkey switches to synchronous replication" to create confusion.

"Is replicated and acknlowedged on all healthy replicas before being executed on the primary"?

This sounds about like synchronization to me, without using the word. Isn't it clear enough? Then maybe even more clearly indicate that the primary waits for ack before continuing, like...

"Is replicated to all healthy replicas and acknowledged back to the primary before being executed on the primary"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Best effort" == "Opportunistic" :) I think we will need a name at some point IF we expand this pattern beyond this command but I am on board with not adding a new concept for this one. I will drop "synchronous" but keep the high level explanation in this doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best effort is kind of the key concept since the very beginning. Memcached doesn't have persistence and replication. Redis/Valkey has it, best effort, async, without compromising latency.

The cluster bus is supposed to be truly consistent though. We'll fix that some day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the MIGRATE command doesn't use the replication nor cluster bus. It just sets up a new connection. Is it synchronous? It waits for OK before deleting the transferred key, I think, or does it? Is this how we should have done SETSLOT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Best effort" == "Opportunistic"

Yes. I would be OK with either of those two FWIW, it's not my top preference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped "synchronous". PTAL

running Valkey version 8.0+. By default, this synchronous replication must complete within 2 seconds.
If the replication fails, the primary does not execute the command, and the client receives a
`NOREPLICAS Not enough good replicas to write` error. Operators can retry the command or customize the
timeout using the `TIMEOUT` parameter to further increase the reliability of live reconfiguration:
Comment on lines +83 to +86
Copy link
Member

@madolson madolson Sep 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we should either decide if this style guide is wrong or not: https://github.com/valkey-io/valkey-doc?tab=readme-ov-file#styling-guidelines. "Start every sentence on a new line."

We didn't force it in the past to retain history, but I find it annoying we sometimes follow it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind dropping it. I don't like to complain about it, so I haven't enforced it.

We can just skip reflowing text when it's edited.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah can we consider dropping this rule? I don't see this being enforced today on the existing doc. I feel that there are a few main situations wherein we need to modify a doc

  1. either we fix typos (or small changes)
  2. or we rewrite some portion of the doc because we have changed the behavior
  3. or we add new content like in this case

In both the second and third cases, I would think we will have to review the entire paragraph already. I don't think we should ever re-flow the first case, irrespective of how we wrote the doc in the first place.

thoughts?


CLUSTER SETSLOT slot [MIGRATING|IMPORTING|NODE] node-id [TIMEOUT timeout]
PingXie marked this conversation as resolved.
Show resolved Hide resolved

Here, `timeout` is measured in seconds, with 0 meaning to wait indefinitely.
41 changes: 41 additions & 0 deletions topics/cluster-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,37 @@ set the slots to their normal state again. The same command is usually
sent to all other nodes to avoid waiting for the natural
propagation of the new configuration across the cluster.

#### Synchronous replication of `CLUSTER SETSLOT`

Starting from Valkey 8.0, the `CLUSTER SETSLOT` command is synchronously replicated to all healthy replicas
running Valkey version 8.0+. By default, this synchronous replication must complete within 2 seconds.
If the replication fails, the primary does not execute the command, and the client receives a
`NOREPLICAS Not enough good replicas to write` error. Operators can retry the command or customize the
timeout using the `TIMEOUT` parameter to further increase the reliability of live reconfiguration:
madolson marked this conversation as resolved.
Show resolved Hide resolved

CLUSTER SETSLOT slot [MIGRATING|IMPORTING|NODE] node-id [TIMEOUT timeout]

Here, `timeout` is measured in seconds, with 0 meaning to wait indefinitely.

Synchronous replication is critical for maintaining cluster consistency during live reconfiguration.
Before applying changes like slot ownership and migrating states to the primary, these must be fully
replicated to all replicas. This prevents loss of state if the primary fails after executing the command.

Consider a scenario where the target primary node `B` is finalizing a slot migration.
Before the `SETSLOT` command is replicated to its replica node `B’`, `B` might send a cluster `PONG`
message to the source primary node `A`, promoting `A` to relinquish its ownership of the slot in question.
If `B` crashes right after this point, the replica node `B’`, which could be elected as the new primary,
would not be aware of the slot ownership transfer without the synchronous replication of `SETSLOT`.
This would leave the slot without an owner, leading to potential data loss and cluster topology inconsistency.
PingXie marked this conversation as resolved.
Show resolved Hide resolved

#### Election in empty shards

Starting from Valkey 8.0, Valkey clusters introduce the ability to elect a primary in empty shards.
This behavior ensures that even when a shard is in the process of receiving its first slot,
a primary can be elected. This prevents scenarios where there would be no primary available in the
empty shard to handle redirected requests from the official slot owner,
thereby maintaining availability during the live reconfiguration.

### ASK redirection
zuiderkwast marked this conversation as resolved.
Show resolved Hide resolved

In the previous section, we briefly talked about ASK redirection. Why can't
Expand Down Expand Up @@ -550,6 +581,16 @@ Slots migration is explained in similar terms but with different wording
(for the sake of redundancy in the documentation) in the `CLUSTER SETSLOT`
command documentation.

Starting from Valkey 8.0, when the primary in either the source or target shard fails during live reconfiguration,
the primary in the other shard will automatically attempt to update its migrating/importing state to correctly pair
with the newly elected primary. If this update is successful, the ASK redirection will continue functioning without
requiring administrator intervention. In the event that slot migration fails, administrators can manually resume
the interrupted slot migration by running the command `valkey-cli --cluster fix <ip:port>`.

Additionally, since Valkey 8.0, replicas are now able to return `ASK` redirects during slot migrations.
This capability was previously unavailable, as replicas were not aware of ongoing slot migrations in earlier versions.
It's worth noting that this change has already been documented in the READONLY command section.
PingXie marked this conversation as resolved.
Show resolved Hide resolved

### Client connections and redirection handling

To be efficient, Valkey Cluster clients maintain a map of the current slot
Expand Down