Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow clients to subscribe to slot migrations #298

Draft
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

zuiderkwast
Copy link
Contributor

IMHO, this is the most important event for cluster clients to be able to subscribe to. Why? In a scaling-and-balancing scenario, many slots can be moved. They are moved one by one. If a client performs a slot mapping update (e.g. calls CLUSTER SLOTS) every time it receives a -MOVED redirect, it will need to do this many times if it gets a MOVED redirect after each migrated slot.

Updating the slot mapping at every MOVED redirect is a recommended behavior according to the cluster spec, so clients do that:

An alternative is to just refresh the whole client-side cluster layout using the CLUSTER SHARDS, or the deprecated CLUSTER SLOTS, command when a MOVED redirection is received. When a redirection is encountered, it is likely multiple slots were reconfigured rather than just one, so updating the client configuration as soon as possible is often the best strategy.

At least for long-lived connections, this is a useful thing to do.

The cluster bus doesn't distinguish between migrations and other actions. It just sends the slot bitmap per node. This feature detects a migration by checking if exactly one slot has a new owner.

Only one moved slot per cluster bus message or command is notified. This is to avoid flooding the clients. If more slots are moved at the same time, such as at failovers, clients will need to handle MOVED redirects and update the slot mapping accordingly. What a node can detect is the modification of slot ownerships and addition/removal of replicas.

A special pubsub channel __cluster__:moved is used (naming inspired by client-side caching). Since the payload of pubsub messages are strings, it is encoded as a string on the form "MOVED slot endpoint:port", just like a MOVED redirects.

Some special logic added to avoid creating the pubsub message if there are no subscribers, and to adjust the port (TLS or non-TLS) to the receiver's connection.

Fixes #57.


Future improvements:

  • Possiblity to be notified about addition/removal of replicas (without any changed slot ownership)
  • Possiblity to be notified about failovers, i.e. multiple slots changed owner.

If we go ahead with __cluster__:moved only for individual slot migrations, we can do the future notifications in different channels.

The larger topology changes can't be communicated in the notification itself. The client will need to fetch the topology again using e.g. CLUSTER SHARDS.

The cluster bus doesn't distinguish between migrations and other actions.
This feature detects a migration by checking if exactly one slot has a new
owner.

Only one moved slot per cluster bus message or command is notified. This
is to avoid flooding the clients. In other cases, such as failovers,
clients will need to handle MOVED redirects and update the slot mapping
accordingly.

A special pubsub channel `__cluster__:moved` is used (naming inspired by
client-side caching). Since the payload of pubsub messages are strings,
it is encoded as a string on the form "MOVED slot endpoint:port", just
like a MOVED redirects.

Some special logic added to avoid creating the pubsub message if there
are no subscribers, and to adjust the port (TLS or non-TLS) to the
receiver's connection.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@madolson madolson added the major-decision-pending Major decision pending by TSC team label Apr 11, 2024
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the approach and idea. I'm not convinced we should keep the -MOVED syntax though. I wonder if it the format should just be __cluster__:moved SLOT <nodeid>. I don't have strong opinions about this though.

@@ -4844,17 +4851,43 @@ void clusterCron(void) {
clusterUpdateState();
}

/* Notify clients subscribed to slot moved events. */
void clusterNotifyMovedSlot(int moved_slot, list *clients) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this should be cluster.h, it seems like all cluster implementations would want to send this type of notification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is using clusterNode which is only in cluster_legacy.[ch].

This separation of cluster and cluster_legacy is quite arbitrary. I don't mind that you fix it or we can just merge the two again. Then I'll rebase this PR. :)

Do you have a better idea?

src/cluster_legacy.c Outdated Show resolved Hide resolved
Comment on lines +1207 to +1213
/* For redirects, verb must start with a dash, e.g. "-ASK" or "-MOVED". */
sds clusterFormatRedirect(const char *verb, int slot, clusterNode *n, int use_tls_port) {
const char *endpoint = clusterNodePreferredEndpoint(n);
int port = clusterNodeClientPort(n, use_tls_port);
return sdscatprintf(sdsempty(), "%s %d %s:%d", verb, slot, endpoint, port);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any clients that primarily store a map of NodeID -> Nodes as opposed to endpoint:port -> Nodes? I ask because I'm wondering if it would be useful to also return the NodeID here as well. I know python doesn't, but I'm less familiar with the other clients, but if there are nodes that don't have the main node map key'd off the endpoints, then maybe it would be easier for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but since redirects use the host:port form, clients need to be able to identify them by this.

@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Apr 11, 2024

I wonder if it the format should just be __cluster__:moved SLOT <nodeid>.

What do you mean by that? One string? Which one is the channel and which one is the message?

The channel is a string and the message is another string. Do you mean the message is "slot nodeid"? This would work (though I prefer host:port because clients need to use that for redirects anyway). It depends on if we call the channel something else, like just __cluster__ or __cluster__:topology, then i'd like MOVED to be part of the message.

If you want to return something else, like an array or map, we can't use pubsub. We could use RESP3 push though and require RESP3 for this feature.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast force-pushed the subscribe-cluster-topology branch from 92b5db0 to 1cd6dda Compare April 12, 2024 12:52
@zuiderkwast
Copy link
Contributor Author

When thinking about client-side caching, I realized the invalidate message looks different depending on RESP version. For RESP2 it's a pubsub message on the form ["message", "__redis__:invalidate", "foo"] but in RESP3 it's just a push message on the 2-element form ["invalidate", ["foo"]].

Pubsub is messy for various reasons (you don't get a proper reply to SUBSCRIBE, these special channels do not propagate in a cluster in the same way normal channels, etc.) so maybe we should just require RESP3 for this and use a new command like CLUSTER SUBCRIBE-MOVED or something like that? The commands gets a proper reply "+OK" and the push messages can be structured properly, like ["moved", 1234, "example.com", 6379]. WDYT?

@madolson
Copy link
Member

What do you mean by that? One string? Which one is the channel and which one is the message?

Sorry. The channel name you proposed was fine __cluster__:moved but I was questioning if we needed to have the remaining string be consistent with the moved response. -MOVED is redundant, we already know it is a moved request, so I think we could drop it. One thing I don't like about the current -MOVED messages is that they are dependent on the client TLS state. It also doesn't cover the other major topology change, which is failovers. I think we could just come up with a new syntax that is easy for clients to read. __cluster__:moved + <slot id> <node id>, should be enough for clients to update their internal structures. If they don't know the node, they can re-discover the topology.

New command is interesting. I really think we should take ownership of a client to be able to play around with the complexity of implementing this suggestion in the client.

@zuiderkwast
Copy link
Contributor Author

I can implement it in hiredis-cluster and ered. It should be fairly easy.

I can probably implement failover notification too, though it's a bit harder to detect and a bit less important imho for clients.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

So we recommend that the client subscribes to this channel. When MOVED occurs, the client can directly update the mapping information of that single slot (instead of the whloe slots mapping), right?

I'm wondering if we should send a special message for CLUSTER_MOVED_SLOT_MULTIPLE so that the client can actively update the mapping (before the MOVED error)?

Another way is, can we send channel messages directly in clusterAddSlot? I think the overhead may not be that big, so that for each slot MOVED, we can just send the message so we don't need to bother to handle CLUSTER_MOVED_SLOT_MULTIPLE.

@zuiderkwast
Copy link
Contributor Author

So we recommend that the client subscribes to this channel. When MOVED occurs, the client can directly update the mapping information of that single slot (instead of the whloe slots mapping), right?

@enjoy-binbin That's right.

I'm wondering if we should send a special message for CLUSTER_MOVED_SLOT_MULTIPLE so that the client can actively update the mapping (before the MOVED error)?

Yes, it's probably a good idea.

Another way is, can we send channel messages directly in clusterAddSlot? I think the overhead may not be that big, so that for each slot MOVED, we can just send the message so we don't need to bother to handle CLUSTER_MOVED_SLOT_MULTIPLE.

In a cluster with 3 shards, if there is a failover, 5000 slots move immediately. We have 5000 messages to every client. I think it's too much.

If we notify like "MOVED MULTIPLE", then the client can reload the mapping, but if all clients are notified at the same time, all clients will reload it at the same time. Can this be a problem?

If we detect that all slots from one node has moved to another node which was previously a replica (i.e. we detect that there was a failover) then we can send a special message for this like "MOVED ALL FROM ip1:port1 TO ip2:port2" (or "FAILOVER ip1:port1 TO ip2:port2"). This will let the client update the mapping without reloading it from the server. I think it can be good for clients. Maybe it's hard to implement it? I don't know...

@barshaul
Copy link
Contributor

Sorry. The channel name you proposed was fine __cluster__:moved but I was questioning if we needed to have the remaining string be consistent with the moved response. -MOVED is redundant, we already know it is a moved request, so I think we could drop it. One thing I don't like about the current -MOVED messages is that they are dependent on the client TLS state. It also doesn't cover the other major topology change, which is failovers. I think we could just come up with a new syntax that is easy for clients to read. __cluster__:moved + <slot id> <node id>, should be enough for clients to update their internal structures. If they don't know the node, they can re-discover the topology.

I haven't encountered a client that stores the node-id directly. If they do store it, it's usually as an additional parameter rather than a hashable key. Most clients store nodes as addr:port. If you only provide the node-id, it would necessitate further modifications to most clients to establish a mapping from node-id to the node itself. I believe returning cluster:moved followed by addr:port, similar to what clients used to receive in the MOVED error, would be the optimal approach.

@enjoy-binbin
Copy link
Member

In a cluster with 3 shards, if there is a failover, 5000 slots move immediately. We have 5000 messages to every client. I think it's too much.

If we notify like "MOVED MULTIPLE", then the client can reload the mapping, but if all clients are notified at the same time, all clients will reload it at the same time. Can this be a problem?

yes, considering we may have thousands of clients, i guess it is too much and will be a problem. But I feel like CLUSTER_MOVED_SLOT_MULTIPLE seems to be more common? After all we should rarely move just one slot in a change.

If we detect that all slots from one node has moved to another node which was previously a replica (i.e. we detect that there was a failover) then we can send a special message for this like "MOVED ALL FROM ip1:port1 TO ip2:port2" (or "FAILOVER ip1:port1 TO ip2:port2"). This will let the client update the mapping without reloading it from the server. I think it can be good for clients. Maybe it's hard to implement it? I don't know...

this seems like a good idea in failover case, i think we can take a try, it doesn’t seem difficult to implement

@zuiderkwast
Copy link
Contributor Author

yes, considering we may have thousands of clients, i guess it is too much and will be a problem. But I feel like CLUSTER_MOVED_SLOT_MULTIPLE seems to be more common? After all we should rarely move just one slot in a change.

No, in slot migration, you only move one slot at a time. When one slot is finished, the next slot is moved. With this feature, the clients will get notification for one slot after each slot is migrated. I described this in the PR above.

this seems like a good idea in failover case, i think we can take a try, it doesn’t seem difficult to implement

Maybe you can help me. :) My plan is to come back to this PR soon, when rebranding docs is finished.

@enjoy-binbin
Copy link
Member

No, in slot migration, you only move one slot at a time. When one slot is finished, the next slot is moved. With this feature, the clients will get notification for one slot after each slot is migrated. I described this in the PR above.

ohh, sorry that is right, i got it mixed up. (In Tencent Cloud we are moving a batch of slots in one change.)

Maybe you can help me. :) My plan is to come back to this PR soon, when rebranding docs is finished.

:) I've been busy lately, it may take me some time to get back involved

@madolson madolson removed the major-decision-pending Major decision pending by TSC team label May 27, 2024
@madolson
Copy link
Member

@zuiderkwast Will work on refining the PR a little bit, and propose it again before Valkey 8.

@madolson madolson added the pending-refinement This issue/request is still a high level idea that needs to be further refined label May 27, 2024
@zuiderkwast
Copy link
Contributor Author

Regarding a magic pubsub channel vs a specific RESP3 push message, I think I'll prefer the RESP3 push for the following reasons:

  • RESP3 push can contain any nested structured RESP data. The payload of a pubsub message is just a string.
  • This is supposed to be handled internally in a cluster-aware client. Possibly by a dedicated connection. Therefore, I think it's acceptable to require the clients to use RESP3 for this. This doesn't mean the client needs to support RESP3 completely. It doesn't need to expose this to the end user.
  • Magic channel names seem like a hack.

Regarding the format, I've been thinking that it'd be good to include the same information as CLUSTER SLOTS, but only for the changed slots. If we want to support a failover with lots of fragmented slot ranges, it'd be good to use the same format as CLUSTER SLOTS DENSE describe in #517 and to release both features together, so clients can implement both of them at the same time.

@zuiderkwast
Copy link
Contributor Author

Hello @nihohit, @ranshid. This one is stalled. Feel free to take over this work.

The main open question is in which format to send the push.

With a format like one push identical to one line of CLUSTER SLOTS, there is no need for clients to call CLUSTER SLOTS. I prefer this format.

I think we can ignore the ideas about CLUSTER SLOTS DENSE for now because we have implemented cached CLUSTER SLOTS instead and there is no recent activity about CLUSTER SLOTS DENSE.

@zuiderkwast zuiderkwast added the stalled No activity for a long time label Dec 16, 2024
@nihohit
Copy link
Contributor

nihohit commented Dec 16, 2024

IMO each node should send a push of only the information relevant to that node + to which node it passes slots.

@zuiderkwast
Copy link
Contributor Author

IMO each node should send a push of only the information relevant to that node + to which node it passes slots.

This means the client needs to subscribe to all primaries. It's a different idea than mine but maybe it's better and the implementation can be simpler. Feel free to open a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster pending-refinement This issue/request is still a high level idea that needs to be further refined stalled No activity for a long time
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[NEW] Send cluster topology changes as push messages.
5 participants