cluster: rebalance on node add may not balance optimally when rack aware placement is in use #6058

jcsp · 2022-08-16T19:09:13Z

Travis @travisdowns noticed that members_backend::calculate_reallocations_after_node_added is using its own custom logic to try and work out which partitions to move, not taking into account any constraints (such as rack affinity). This doesn't violate constraints, because the ultimate destination of partitions is still chosen by the partition allocator (which does respect rack affinity), but it may create situations where rebalance on node add doesn't do what one would expect.

the way I'm reading it, the add code is doing it's own homebrew thing to decide where to move from, but the partition allocator is getting called to decide where to move to>

I'm trying to reason about how much impact there is in a typical case. In the situation where we have r=3 partitions and 3 racks, then we add one node in a rack, then all the partitions we select to move will probably re-assign successfully.
However we would pick the number of partitions to move wrongly. If we had 3 nodes, 1 in each rack, then add one additional node into one of the racks, then we would need to move 50% of partitions to balance evenly between the two nodes in the rack where we added one. But we will only try to move 25% of partitions because we divide the total replica count by the total node count.

mattschumpert · 2022-09-14T21:07:21Z

@jcsp this really seems rather like a bug, and one that should be fixed in Q3. It seems rather obvious that since we've had rack awareness since 22.1 that add node and other workflows should respect this. Can we fix this in Q3 (even at the expense of something else).

cc @piyushredpanda, @mmedenjak

jcsp · 2022-09-15T07:42:06Z

Yep, it's definitely a bug, although perhaps not a critical one: if we get too close to e.g. disk full then the data rebalancer will kick in and rearrange things.

mmedenjak · 2022-09-15T11:29:55Z

Possibly related: #6355

jcsp · 2022-09-15T11:46:33Z

This distinction between this and #6355 is:

This ticket is about a case where on node add we don't move enough partitions to fill the new node
The other ticket is about maybe making rack affinity a hard constraint rather than a soft constraint in general.

mattschumpert · 2022-09-15T15:03:45Z

OK, re-reading this again, not absolutely critical indeed but more of an optimization.

I would expect the balancer to kick in anyways and help fix the problem (and that it would do so correctly in a multi-az situation with some respect to the racks). Is that true?

RE #6355, rack placement really should be a hard constraint (at least big alarms should go off if we can't have 2 or 3 rack redundancy because we're out of disk) . Rack placement is a fundamental contract we have with the application in the most important way (against data loss)

ztlpn · 2022-09-15T15:23:15Z

@mattschumpert Yeah, just to clarify, the problem described in this issue is not that node add doesn't respect rack awareness but rather non-optimal placement if rack awareness is enabled.

An example:

Suppose we have 3 nodes in AZs A, B, and C and 4 partitions. Before the new node is added we have:

node 1: p1r1, p2r1, p3r1, p4r1
node 2: p1r2, p2r2, p3r2, p4r2
node 3: p1r3, p2r3, p3r3, p4r3

Now suppose we add the fourth node in AZ C. Optimally we would expect something like:

node 1: p1r1, p2r1, p3r1, p4r1
node 2: p1r2, p2r2, p3r2, p4r2
node 3: p1r3, p2r3
node 4: p3r3, p4r3

But right now we get:

node 1: p1r1, p2r1, p3r1, p4r1
node 2: p1r2, p2r2, p3r2, p4r2
node 3: p1r3, p2r3, p3r3
node 4: p4r3

Node 4 ends up with fewer partitions than we would expect.

mattschumpert · 2022-09-15T16:48:11Z

Would continuous balancing even it out over time (even if not immediately)?

In any case, I am more concerned about #6355 due to the fragility we seemingly could have with '1 replica' partitions on a single AZ failure (a relatively common event)

ztlpn · 2022-09-15T16:52:09Z

Would continuous balancing even it out over time (even if not immediately)?

If all nodes don't violate the disk usage limit, no.

In any case, I am more concerned about #6355 due to the fragility we seemingly could have with '1 replica' partitions on a single AZ failure (a relatively common event)

agreed

ztlpn · 2023-06-22T11:43:40Z

BTW this was fixed by #11366

jcsp added kind/bug Something isn't working area/controller labels Aug 16, 2022

mmedenjak mentioned this issue Sep 15, 2022

Continual rack-awareness rebalancing for multi-az deployments #6355

Closed

ztlpn closed this as completed Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: rebalance on node add may not balance optimally when rack aware placement is in use #6058

cluster: rebalance on node add may not balance optimally when rack aware placement is in use #6058

jcsp commented Aug 16, 2022

mattschumpert commented Sep 14, 2022

jcsp commented Sep 15, 2022

mmedenjak commented Sep 15, 2022

jcsp commented Sep 15, 2022

mattschumpert commented Sep 15, 2022

ztlpn commented Sep 15, 2022 •

edited

Loading

mattschumpert commented Sep 15, 2022 •

edited

Loading

ztlpn commented Sep 15, 2022

ztlpn commented Jun 22, 2023

cluster: rebalance on node add may not balance optimally when rack aware placement is in use #6058

cluster: rebalance on node add may not balance optimally when rack aware placement is in use #6058

Comments

jcsp commented Aug 16, 2022

mattschumpert commented Sep 14, 2022

jcsp commented Sep 15, 2022

mmedenjak commented Sep 15, 2022

jcsp commented Sep 15, 2022

mattschumpert commented Sep 15, 2022

ztlpn commented Sep 15, 2022 • edited Loading

mattschumpert commented Sep 15, 2022 • edited Loading

ztlpn commented Sep 15, 2022

ztlpn commented Jun 22, 2023

ztlpn commented Sep 15, 2022 •

edited

Loading

mattschumpert commented Sep 15, 2022 •

edited

Loading