Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: rebalance on node add may not balance optimally when rack aware placement is in use #6058

Closed
jcsp opened this issue Aug 16, 2022 · 9 comments
Labels
area/controller kind/bug Something isn't working

Comments

@jcsp
Copy link
Contributor

jcsp commented Aug 16, 2022

Travis @travisdowns noticed that members_backend::calculate_reallocations_after_node_added is using its own custom logic to try and work out which partitions to move, not taking into account any constraints (such as rack affinity). This doesn't violate constraints, because the ultimate destination of partitions is still chosen by the partition allocator (which does respect rack affinity), but it may create situations where rebalance on node add doesn't do what one would expect.

the way I'm reading it, the add code is doing it's own homebrew thing to decide where to move from, but the partition allocator is getting called to decide where to move to>

I'm trying to reason about how much impact there is in a typical case. In the situation where we have r=3 partitions and 3 racks, then we add one node in a rack, then all the partitions we select to move will probably re-assign successfully.
However we would pick the number of partitions to move wrongly. If we had 3 nodes, 1 in each rack, then add one additional node into one of the racks, then we would need to move 50% of partitions to balance evenly between the two nodes in the rack where we added one. But we will only try to move 25% of partitions because we divide the total replica count by the total node count.

@jcsp jcsp added kind/bug Something isn't working area/controller labels Aug 16, 2022
@mattschumpert
Copy link

@jcsp this really seems rather like a bug, and one that should be fixed in Q3. It seems rather obvious that since we've had rack awareness since 22.1 that add node and other workflows should respect this. Can we fix this in Q3 (even at the expense of something else).

cc @piyushredpanda, @mmedenjak

@jcsp
Copy link
Contributor Author

jcsp commented Sep 15, 2022

Yep, it's definitely a bug, although perhaps not a critical one: if we get too close to e.g. disk full then the data rebalancer will kick in and rearrange things.

@mmedenjak
Copy link
Contributor

Possibly related: #6355

@jcsp
Copy link
Contributor Author

jcsp commented Sep 15, 2022

This distinction between this and #6355 is:

  • This ticket is about a case where on node add we don't move enough partitions to fill the new node
  • The other ticket is about maybe making rack affinity a hard constraint rather than a soft constraint in general.

@mattschumpert
Copy link

OK, re-reading this again, not absolutely critical indeed but more of an optimization.

I would expect the balancer to kick in anyways and help fix the problem (and that it would do so correctly in a multi-az situation with some respect to the racks). Is that true?

RE #6355, rack placement really should be a hard constraint (at least big alarms should go off if we can't have 2 or 3 rack redundancy because we're out of disk) . Rack placement is a fundamental contract we have with the application in the most important way (against data loss)

@ztlpn
Copy link
Contributor

ztlpn commented Sep 15, 2022

@mattschumpert Yeah, just to clarify, the problem described in this issue is not that node add doesn't respect rack awareness but rather non-optimal placement if rack awareness is enabled.

An example:

Suppose we have 3 nodes in AZs A, B, and C and 4 partitions. Before the new node is added we have:

  • node 1: p1r1, p2r1, p3r1, p4r1
  • node 2: p1r2, p2r2, p3r2, p4r2
  • node 3: p1r3, p2r3, p3r3, p4r3

Now suppose we add the fourth node in AZ C. Optimally we would expect something like:

  • node 1: p1r1, p2r1, p3r1, p4r1
  • node 2: p1r2, p2r2, p3r2, p4r2
  • node 3: p1r3, p2r3
  • node 4: p3r3, p4r3

But right now we get:

  • node 1: p1r1, p2r1, p3r1, p4r1
  • node 2: p1r2, p2r2, p3r2, p4r2
  • node 3: p1r3, p2r3, p3r3
  • node 4: p4r3

Node 4 ends up with fewer partitions than we would expect.

@mattschumpert
Copy link

mattschumpert commented Sep 15, 2022

Would continuous balancing even it out over time (even if not immediately)?

In any case, I am more concerned about #6355 due to the fragility we seemingly could have with '1 replica' partitions on a single AZ failure (a relatively common event)

@ztlpn
Copy link
Contributor

ztlpn commented Sep 15, 2022

Would continuous balancing even it out over time (even if not immediately)?

If all nodes don't violate the disk usage limit, no.

In any case, I am more concerned about #6355 due to the fragility we seemingly could have with '1 replica' partitions on a single AZ failure (a relatively common event)

agreed

@ztlpn
Copy link
Contributor

ztlpn commented Jun 22, 2023

BTW this was fixed by #11366

@ztlpn ztlpn closed this as completed Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants