-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: rebalance on node add may not balance optimally when rack aware placement is in use #6058
Comments
@jcsp this really seems rather like a bug, and one that should be fixed in Q3. It seems rather obvious that since we've had rack awareness since 22.1 that add node and other workflows should respect this. Can we fix this in Q3 (even at the expense of something else). |
Yep, it's definitely a bug, although perhaps not a critical one: if we get too close to e.g. disk full then the data rebalancer will kick in and rearrange things. |
Possibly related: #6355 |
This distinction between this and #6355 is:
|
OK, re-reading this again, not absolutely critical indeed but more of an optimization. I would expect the balancer to kick in anyways and help fix the problem (and that it would do so correctly in a multi-az situation with some respect to the racks). Is that true? RE #6355, rack placement really should be a hard constraint (at least big alarms should go off if we can't have 2 or 3 rack redundancy because we're out of disk) . Rack placement is a fundamental contract we have with the application in the most important way (against data loss) |
@mattschumpert Yeah, just to clarify, the problem described in this issue is not that node add doesn't respect rack awareness but rather non-optimal placement if rack awareness is enabled. An example: Suppose we have 3 nodes in AZs A, B, and C and 4 partitions. Before the new node is added we have:
Now suppose we add the fourth node in AZ C. Optimally we would expect something like:
But right now we get:
Node 4 ends up with fewer partitions than we would expect. |
Would continuous balancing even it out over time (even if not immediately)? In any case, I am more concerned about #6355 due to the fragility we seemingly could have with '1 replica' partitions on a single AZ failure (a relatively common event) |
If all nodes don't violate the disk usage limit, no.
agreed |
BTW this was fixed by #11366 |
Travis @travisdowns noticed that members_backend::calculate_reallocations_after_node_added is using its own custom logic to try and work out which partitions to move, not taking into account any constraints (such as rack affinity). This doesn't violate constraints, because the ultimate destination of partitions is still chosen by the partition allocator (which does respect rack affinity), but it may create situations where rebalance on node add doesn't do what one would expect.
The text was updated successfully, but these errors were encountered: