Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Compact variant CLUSTER SLOTS DENSE #517

Open
madolson opened this issue May 19, 2024 · 17 comments
Open

[NEW] Compact variant CLUSTER SLOTS DENSE #517

madolson opened this issue May 19, 2024 · 17 comments
Labels
major-decision-pending Major decision pending by TSC team

Comments

@madolson
Copy link
Member

madolson commented May 19, 2024

One recurring issue that we, at AWS, have noticed is that over time clusters will naturally have slot ranges become fragmented between primaries. For example, if you have 2 nodes, a defragmented slot range would be primary 1 has slots 0-8192, and primary 2 has slots 8193 - 16383. A maximally fragmented cluster would have primary 1 have all even slots (0, 2, 4, 6) and primary 2 have all odd slots (1, 3, 5). Whenever you do a rebalance operation, even if you start with continuous ranges, it's not possible to maintain a continuous range if you are moving the minimum number of slots.

Fragmented clusters don't cause performance issues for get/set operations, but they do cause performance degradation during topology commands. CLUSTER SLOTS is the worst offender, as it emits a node's full topology information for each slot range a node owns. CLUSTER SHARDS was an attempt to mitigate this, because it outputs the information from each shard once and only once, and represents the nodes slots as a list of start and stop ranges. However, shards have not been widely adopted by clients. Client maintainers have also requested to alter the behavior of CLUSTER SHARDS.

Implement defragmentation logic

We could add a new operation into the valkey-cli that does a rebalance operation to "defragment" the slot distribution, to get back to continuous ranges. Operators can run this operation periodically when they notice highly fragmented clusters.

Implement CLUSTER SHARDS TOPOLOGY so that shards omits non-deterministic information

As discussed in #411 (comment), we could modify the CLUSTER SHARDS command to omit the non-deterministic information about the cluster.

Implement CLUSTER SLOTS DENSE client capability.

In the same vein as cluster shards topology, we could also update CLUSTER SLOTS to support the ability to return a compact slot range. The current format of the command has field 1 and 2 be the start and stop ranges, but clients could support the ability to dynamically detect whether or not it's an integer or an array of start/stop. Clients can opt-in to this functionality either by sending a customer command CLUSTER SLOTS COMPACT or by introducing a client capability so that clients can opt-in to this functionality. The new CLUSTER SLOTS output might look something like:

> CLUSTER SLOTS
1) 1) 1) (integer) 0  -- Start of range 1
      2) (integer) 10000 -- Start of range 2
   2) 1) (integer) 5460 -- End of range 1
      2) (integer) 12000 -- End of range 2
   3) 1) "127.0.0.1"
      2) (integer) 30001
      3) "09dbe9720cda62f7865eabc5fd8857c5d2678366"
      4) 1) hostname
         2) "host-1.valkey.example.com"
   4) 1) "127.0.0.1"
      2) (integer) 30004
      3) "821d8ca00d7ccf931ed3ffc7e3db0599d2271abf"
      4) 1) hostname
         2) "host-2.valkey.example.com"
2) 1) 1) (integer) 5461 -- Start of range 1
      2) (integer) 12001 -- Start of range 2
   2) 1) (integer) 9999 -- End of range 1
      2) (integer) 16383 -- End of range 2
   3) 1) "127.0.0.1"
      2) (integer) 30002
      3) "c9d93d9f2c0c524ff34cc11838c2003d8c29e013"
      4) 1) hostname
         2) "host-3.valkey.example.com"
   4) 1) "127.0.0.1"
      2) (integer) 30005
      3) "faadb3eb99009de4ab72ad6b6ed87634c7ee410f"
      4) 1) hostname
         2) "host-4.valkey.example.com"
@barshaul
Copy link
Contributor

I definitely feel that CLUSTER SLOTS had it almost right, and that CLUSTER SHARDS adds functionality that most clients don't need. Since most clients already use CLUSTER SLOTS, adopting this option should be relatively easy. However, note that we still need to sort the replicas in the current CLUSTER SLOTS implementation, not just compact the slots data.

@madolson madolson added the major-decision-pending Major decision pending by TSC team label May 22, 2024
zuiderkwast pushed a commit that referenced this issue May 24, 2024
Undeprecate cluster slots command. This command is widely used by
clients to form the cluster topology and with the recent change to
improve performance of `CLUSTER SLOTS` command via #53 as well as us
looking to further improve the usability via #517, it makes sense to
undeprecate this command.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@hpatro
Copy link
Contributor

hpatro commented May 24, 2024

Implement CLUSTER SLOTS DENSE client capability.

CLUSTER SLOTS is one of the mostly widely used command by clients for topology discovery. This would be a good feature addition and solve the large output response on fragmented slots. I think we can introduce the CLIENT CAPA <feature> <yes|no> to enable/disable the feature rather than introducing another level of sub command.

@zuiderkwast
Copy link
Contributor

I think we should just do better marketing for CLUSTER SHARDS. It's very little extra information that the clients can simply ignore, so it's no real waste of bandwidth. It's also pretty trivial to parse, given the client can parse RESP, so that can hardly be a real problem.

Of the ideas suggested, I only support implementing defragmentation logic in valkey-cli, or at least making sure it's smart when rebalancing so that in minimizes creating fragmentation.

IMO we should instead focus on this:

  • Improve the docs. Let's show an example of CLUSTER SHARDS in RESP3, so you actually see the reply as a map, which is was designed for.
  • Explain the benefits, both in the docs of CLUSTER SHARDS and CLUSTER SLOTS.
  • Give it more time. CLUSTER SHARDS is still new. Clients want to support nodes running all still supported Redis OSS versions. The motivation for moving to CLUSTER SHARDS hasn't been strong enough, given they'll need to have a fallback anyway, but once all versions that don't support SHARDS are EoL, the situation will be different.
  • It may not be a frequent problem. To get very scattered slot distributions, you'd need to scale up multiple times (e.g. add one shard at a time) without paying attention to the slots. It's possible for cluster operators (not only valkey-cli) to be smarter and avoid creating fragmented slot ownershits. Always doubling or halving the number of shards is one way to completely avoid fragmentation, if done properly.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented May 24, 2024

... and if we do want to support DENSE, I think the slot ranges should be represented as a flat multi-range list [start1, end1, start2, end2, ...] rather than one list of starts and another list of ends. It's more intuitive (IMAO). For a single-range shard, it's identical to a non-DENSE reply.

@madolson
Copy link
Member Author

madolson commented May 24, 2024

... and if we do want to support DENSE, I think the slot ranges should be represented as a flat multi-range list [start1, end1, start2, end2, ...] rather than one list of starts and another list of ends. It's more intuitive (IMAO). For a single-range shard, it's identical to a non-DENSE reply.

The only reason I suggested the other approach is that it keeps the total number of arguments the same in all cases :) I'm not very convinced it's useful for them to be the same in specific edge cases.

Improve the docs. Let's show an example of CLUSTER SHARDS in RESP3, so you actually see the reply as a map, which is was designed for.

I honestly am not really convinced the map response is ideal for this case anymore. We are basically spending a bunch of bits in the network response to send information the client already knows. It's only useful if a human is reading it ad hoc.

Give it more time. CLUSTER SHARDS is still new. Clients want to support nodes running all still supported Redis OSS versions. The motivation for moving to CLUSTER SHARDS hasn't been strong enough, given they'll need to have a fallback anyway, but once all versions that don't support SHARDS are EoL, the situation will be different.

I don't agree because client developers (see Bar) aren't happy with it. I think we should be opinionated about the API. I think our original approach of moving to a new new command didn't really work as well as we thought it would.

@zuiderkwast
Copy link
Contributor

OK, I'm convinced. 👍

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Jun 10, 2024

I like CLUSTER SLOTS DENSE (or COMPACT) but I do think the slots should be a list of starts and end indices. That's the slot format used in CLUSTER SHARDS and CLUSTER ADDSLOTRANGE.

The idea about two lists that need to be iterated in parallel isn't great.

Another option we could add to CLUSTER SLOTS is NO-REPLICAS, that clients can use if they only care about primaries.

We should introduce these changes togerher with #298 and a possible format for push notifications id to use the same as one line of CLUSTER SLOTS DENSE. Clients can then implement both features at the same time.

@zuiderkwast zuiderkwast changed the title [NEW] Compacting the output of topology commands for for fragmented clusters [NEW] Compact variant CLUSTER SLOTS DENSE Jun 18, 2024
@zuiderkwast
Copy link
Contributor

I renamed it so it's easier to find when searching for it. :)

@madolson
Copy link
Member Author

madolson commented Aug 5, 2024

@barshaul Also, the cluster slots output is sorted now, so it should be deterministic.

@barshaul
Copy link
Contributor

barshaul commented Aug 7, 2024

@barshaul Also, the cluster slots output is sorted now, so it should be deterministic.

Nice, thanks. Was sorting the replicas added in version 8.0?

However, it's still not ideal to use cluster slots because of the potential for very large outputs. While sorting the replicas reduces computation on the client side, it still exposes the client to delays due to large responses.

I believe we should establish guidelines on how clients should manage and update their topology, which will help us determine the best command strategy. Valkey can lead the way in standardizing best practices for OSS clients.

I can create a document outlining client design for handling cluster topology changes and share it with you for feedback. What do you think?

@madolson
Copy link
Member Author

madolson commented Aug 7, 2024

Nice, thanks. Was sorting the replicas added in version 8.0?

Yes, see #265. Replicas are ordered in CLUSTER SLOTS.

However, it's still not ideal to use cluster slots because of the potential for very large outputs. While sorting the replicas reduces computation on the client side, it still exposes the client to delays due to large responses.

Yeah, that is sort of option three from the top of the issue. If the client sends CLIENT CAPA DENSE-SLOTS, the cluster will send a more compact version of the cluster slots output, where the additional slots associated with the node are also sent. Clients needs to be smart enough to realize they were sent this dense output and handle it appropriately.

I can create a document outlining client design for handling cluster topology changes and share it with you for feedback. What do you think?

I think this would be great. You might consider starting it here, https://github.com/valkey-io/valkey-rfc, then we can merge it into the valkey doc repo when it's ready.

@zuiderkwast
Copy link
Contributor

Taking a step back, I wonder if it's possible to scale in a way that completely avoids creating fragmented slot ranges.

I read about consistent hashing. Translated to our terminology with cluster slots and shards, it means that each node has a single slot range. When a new shard is added, we would just split one of the ranges of another node. This means the ranges will not be of equal size though, but if the number of nodes are doubled, they will.

A bunch of databases (including Cassandra) seem to use this algorithm for sharding. See examples.

If ranges are of different size, or some slots are larger or more hot than other slots, then we could redistribute slots by just transferring slots between neighbours, so we keep this property that each shard has only a single range.

It's already possible to assign slots to nodes in this way. If valkey-cli reharding/rebalance creates fragmented slot ranges, then we can adjust valkey-cli to avoids that.

@madolson
Copy link
Member Author

I think the main downsides are we have to do more work to figure out which shard the node maps to (it's Log(N) where N is the number of ranges) and it's more expensive to do migration unless we maintain an ordered index.

If ranges are of different size, or some slots are larger or more hot than other slots, then we could redistribute slots by just transferring slots between neighbours, so we keep this property that each shard has only a single range.

This isn't a property of consistent hashing, we could do that with our slot based implementation. The problem is when you have three or more shards and are adding a 4th node, there is no way to get 4 consistent ranges by only moving slots to the new node. We would be able to get a consistent range by moving slots between nodes though.

@zuiderkwast
Copy link
Contributor

we have to do more work to figure out which shard the node maps to.

Yes, but it's bounded to 16K ranges. I don't think it's a heavy operation.

This isn't a property of consistent hashing, we could do that with our slot based implementation.

Yeah we can achieve single-range per shard with the slot-based implementation. Consistent hashing is just for comparison. If other databases use it, they must accept the downsides that the load is not even, or that you have to more things around more.

The problem is when you have three or more shards and are adding a 4th node, there is no way to get 4 consistent ranges by only moving slots to the new node. We would be able to get a consistent range by moving slots between nodes though.

Yes, we'd have to move slots between all four shards in that case, so there are more keys to move. But you say you usually double the shards when you scale up?

When scaling down, or when just adjusting load among the shards, you can slowly move slots between neighbour shards.

Few ranges per shard is also acceptable. If it gets too fragmented, we can defrag... valkey-cli --cluster defrag. Or some incremental slot-defrag while scaling, so the fragmentation is bounded in some way. 🤔

@madolson
Copy link
Member Author

Few ranges per shard is also acceptable. If it gets too fragmented, we can defrag... valkey-cli --cluster defrag. Or some incremental slot-defrag while scaling, so the fragmentation is bounded in some way. 🤔

I proposed that in the top issue as well. Maybe I'll make a dedicated issue for that to see if someone wants to implement it? It seems like a nice to have regardless.

@madolson
Copy link
Member Author

Yeah we can achieve single-range per shard with the slot-based implementation. Consistent hashing is just for comparison. If other databases use it, they must accept the downsides that the load is not even, or that you have to more things around more.

I think Redis was always so indexed on speed, that the overhead of consistent hashing was assumed to be bad. Maybe it's worth prototyping to evaluate if that is true. It's maybe also mentioning that as we've discussed with the main dictionary, CPUs are super fast but often just stalled on memory nowadays, so it might not impose any degradation to switch to consistent hashing. A huge drawback will be client support though 🫠.

@zuiderkwast
Copy link
Contributor

If you scale from 3 to 4 shards, each holding 1/3 of the slots, to 4 shards each holding 1/4 of the slots, ignoring fragmentation, you would add a new shard and move 1/4 from each shard (1/12 of the total number of slots) to the new shard, i.e. you move in total 3/12 = 1/4 of the slots.

If resharding without creating fragmentation, you would move like this:

  4/12             4/12            4/12
+---------------+---------------+---------------+
| A             | B             | C             |
|               |               |               |
+-----------+---+-------+-------+----+----------+
| 3         | 1 | 2     | 2     | 1  | 3        |
|-----------+---+-------+-------+----+----------+
              |             |     |
              V             V     V
+-----------+-----------+------------+----------+
| A         |    B      | D          | C        |
|           |           |            |          |
+-----------+-----------+------------+----------+
 3/12        3/12        3/12         3/12

So you move 3/12 to the new shard D, plus A moves 1/12 of the slots to B. Total moved slots are 1/3. Yes, it's a little more, but not too bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team
Projects
Status: Idea
Development

No branches or pull requests

4 participants