-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Compact variant CLUSTER SLOTS DENSE #517
Comments
I definitely feel that CLUSTER SLOTS had it almost right, and that CLUSTER SHARDS adds functionality that most clients don't need. Since most clients already use CLUSTER SLOTS, adopting this option should be relatively easy. However, note that we still need to sort the replicas in the current CLUSTER SLOTS implementation, not just compact the slots data. |
Undeprecate cluster slots command. This command is widely used by clients to form the cluster topology and with the recent change to improve performance of `CLUSTER SLOTS` command via #53 as well as us looking to further improve the usability via #517, it makes sense to undeprecate this command. --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
|
I think we should just do better marketing for CLUSTER SHARDS. It's very little extra information that the clients can simply ignore, so it's no real waste of bandwidth. It's also pretty trivial to parse, given the client can parse RESP, so that can hardly be a real problem. Of the ideas suggested, I only support implementing defragmentation logic in valkey-cli, or at least making sure it's smart when rebalancing so that in minimizes creating fragmentation. IMO we should instead focus on this:
|
... and if we do want to support DENSE, I think the slot ranges should be represented as a flat multi-range list |
The only reason I suggested the other approach is that it keeps the total number of arguments the same in all cases :) I'm not very convinced it's useful for them to be the same in specific edge cases.
I honestly am not really convinced the map response is ideal for this case anymore. We are basically spending a bunch of bits in the network response to send information the client already knows. It's only useful if a human is reading it ad hoc.
I don't agree because client developers (see Bar) aren't happy with it. I think we should be opinionated about the API. I think our original approach of moving to a new new command didn't really work as well as we thought it would. |
OK, I'm convinced. 👍 |
I like CLUSTER SLOTS DENSE (or COMPACT) but I do think the slots should be a list of starts and end indices. That's the slot format used in CLUSTER SHARDS and CLUSTER ADDSLOTRANGE. The idea about two lists that need to be iterated in parallel isn't great. Another option we could add to CLUSTER SLOTS is NO-REPLICAS, that clients can use if they only care about primaries. We should introduce these changes togerher with #298 and a possible format for push notifications id to use the same as one line of CLUSTER SLOTS DENSE. Clients can then implement both features at the same time. |
I renamed it so it's easier to find when searching for it. :) |
@barshaul Also, the cluster slots output is sorted now, so it should be deterministic. |
Nice, thanks. Was sorting the replicas added in version 8.0? However, it's still not ideal to use cluster slots because of the potential for very large outputs. While sorting the replicas reduces computation on the client side, it still exposes the client to delays due to large responses. I believe we should establish guidelines on how clients should manage and update their topology, which will help us determine the best command strategy. Valkey can lead the way in standardizing best practices for OSS clients. I can create a document outlining client design for handling cluster topology changes and share it with you for feedback. What do you think? |
Yes, see #265. Replicas are ordered in
Yeah, that is sort of option three from the top of the issue. If the client sends
I think this would be great. You might consider starting it here, https://github.com/valkey-io/valkey-rfc, then we can merge it into the valkey doc repo when it's ready. |
Taking a step back, I wonder if it's possible to scale in a way that completely avoids creating fragmented slot ranges. I read about consistent hashing. Translated to our terminology with cluster slots and shards, it means that each node has a single slot range. When a new shard is added, we would just split one of the ranges of another node. This means the ranges will not be of equal size though, but if the number of nodes are doubled, they will. A bunch of databases (including Cassandra) seem to use this algorithm for sharding. See examples. If ranges are of different size, or some slots are larger or more hot than other slots, then we could redistribute slots by just transferring slots between neighbours, so we keep this property that each shard has only a single range. It's already possible to assign slots to nodes in this way. If |
I think the main downsides are we have to do more work to figure out which shard the node maps to (it's Log(N) where N is the number of ranges) and it's more expensive to do migration unless we maintain an ordered index.
This isn't a property of consistent hashing, we could do that with our slot based implementation. The problem is when you have three or more shards and are adding a 4th node, there is no way to get 4 consistent ranges by only moving slots to the new node. We would be able to get a consistent range by moving slots between nodes though. |
Yes, but it's bounded to 16K ranges. I don't think it's a heavy operation.
Yeah we can achieve single-range per shard with the slot-based implementation. Consistent hashing is just for comparison. If other databases use it, they must accept the downsides that the load is not even, or that you have to more things around more.
Yes, we'd have to move slots between all four shards in that case, so there are more keys to move. But you say you usually double the shards when you scale up? When scaling down, or when just adjusting load among the shards, you can slowly move slots between neighbour shards. Few ranges per shard is also acceptable. If it gets too fragmented, we can defrag... |
I proposed that in the top issue as well. Maybe I'll make a dedicated issue for that to see if someone wants to implement it? It seems like a nice to have regardless. |
I think Redis was always so indexed on speed, that the overhead of consistent hashing was assumed to be bad. Maybe it's worth prototyping to evaluate if that is true. It's maybe also mentioning that as we've discussed with the main dictionary, CPUs are super fast but often just stalled on memory nowadays, so it might not impose any degradation to switch to consistent hashing. A huge drawback will be client support though 🫠. |
If you scale from 3 to 4 shards, each holding 1/3 of the slots, to 4 shards each holding 1/4 of the slots, ignoring fragmentation, you would add a new shard and move 1/4 from each shard (1/12 of the total number of slots) to the new shard, i.e. you move in total 3/12 = 1/4 of the slots. If resharding without creating fragmentation, you would move like this:
So you move 3/12 to the new shard D, plus A moves 1/12 of the slots to B. Total moved slots are 1/3. Yes, it's a little more, but not too bad. |
One recurring issue that we, at AWS, have noticed is that over time clusters will naturally have slot ranges become fragmented between primaries. For example, if you have 2 nodes, a defragmented slot range would be primary 1 has slots 0-8192, and primary 2 has slots 8193 - 16383. A maximally fragmented cluster would have primary 1 have all even slots (0, 2, 4, 6) and primary 2 have all odd slots (1, 3, 5). Whenever you do a rebalance operation, even if you start with continuous ranges, it's not possible to maintain a continuous range if you are moving the minimum number of slots.
Fragmented clusters don't cause performance issues for get/set operations, but they do cause performance degradation during topology commands.
CLUSTER SLOTS
is the worst offender, as it emits a node's full topology information for each slot range a node owns.CLUSTER SHARDS
was an attempt to mitigate this, because it outputs the information from each shard once and only once, and represents the nodes slots as a list of start and stop ranges. However, shards have not been widely adopted by clients. Client maintainers have also requested to alter the behavior ofCLUSTER SHARDS
.Implement defragmentation logic
We could add a new operation into the valkey-cli that does a rebalance operation to "defragment" the slot distribution, to get back to continuous ranges. Operators can run this operation periodically when they notice highly fragmented clusters.
Implement
CLUSTER SHARDS TOPOLOGY
so that shards omits non-deterministic informationAs discussed in #411 (comment), we could modify the
CLUSTER SHARDS
command to omit the non-deterministic information about the cluster.Implement
CLUSTER SLOTS DENSE
client capability.In the same vein as cluster shards topology, we could also update
CLUSTER SLOTS
to support the ability to return a compact slot range. The current format of the command has field 1 and 2 be the start and stop ranges, but clients could support the ability to dynamically detect whether or not it's an integer or an array of start/stop. Clients can opt-in to this functionality either by sending a customer commandCLUSTER SLOTS COMPACT
or by introducing a client capability so that clients can opt-in to this functionality. The newCLUSTER SLOTS
output might look something like:The text was updated successfully, but these errors were encountered: