-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition balancer: integrate partition reclaimable size #13480
Partition balancer: integrate partition reclaimable size #13480
Conversation
/ci-repeat 2 |
5296199
to
a7eb66d
Compare
/ci-repeat 2 |
/ci-repeat 2 |
The reason is twofold: 1. We have another option for controlling the concurrency of the movement batch - partition_autobalancing_concurrent_moves which is easier to understand 2. Part of initial motivation is gone - in the past we waited for the whole batch to finish before scheduling the next movements so it was important for the batch to finish in a reasonable amount of time, but now we can schedule new movement while the old ones are still in progress.
With the introduction of space management data eviction becomes lazy: partitions can grow beyond their eviction limit if there is sufficient space to do it and only if disk space pressure becomes high, this "extra" data is evicted. For partition balancing this means two things: 1) partitions sizes can vary widely between replicas and 2) when predicting how much space the partition will take on a node if it is moved there, we must take into account the possibility of reclaim (otherwise we won't allow moves that are perfectly possible if reclaim is applied). So this commit we 1) store current sizes in a per-replica map instead of a single value and 2) use non-reclaimable size when predicting the partition size on the target replica.
a7eb66d
to
01493a3
Compare
01493a3
to
dde8911
Compare
I've seen some instances of #8226 but after I added trace logging, they disappeared :/ (several 100s green runs). I guess this patch doesn't make the situation worse and we can merge it with trace logging to understand possible future instances of 8226 better. |
upgrade tests all failed because |
/ci-repeat 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code lgtm, i am worried about the test failures
all test failures are
retrying... |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37703#018ad6a1-e48c-452b-ba8f-a80dab0a6c49 |
The retried failure is #9788 |
/backport v23.2.x |
Failed to create a backport PR to v23.2.x branch. I tried:
|
Backports Required
Release Notes
Improvements