Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow partition balancing in presence of moving partitions #10724

Merged
merged 15 commits into from
May 22, 2023

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented May 12, 2023

  • Add unified partition interface for use in balancing/repair operators - the partition variant makes possible only those actions that make sense - e.g. a moving partition can only be cancelled and a partition already in the process of cancellation can only be observed.
  • Allow partition balancer planner operation in the presence of already scheduled moves.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Improvements

  • Partition balancing now schedules new moves without waiting for the previous batch to finish.

@ztlpn
Copy link
Contributor Author

ztlpn commented May 16, 2023

/ci-repeat 2
release
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest

@mmaslankaprv
Copy link
Member

Would it be possible to add a commit comment on how moving partitions are handled ?

ztlpn added 4 commits May 18, 2023 17:56
If we are going to allow balancer operation in presence of moving
partitions, we must take into account their contribution to
assigned/released disk sizes: moving partitions will add some disk space
to new replicas and release some disk space on old ones; and cancelling
partitions will release some disk space on target replicas.
Split the partition class into 3 variants, depending on which actions we
can do to it:
* reassignable_partition - partition is not in progress, we can move
  replicas
* moving_partition - partition movement is in progress, we can cancel it
* immutable_partition - we can do nothing only report failure
Our handling of moving partitions is pretty conservative - we don't
touch them, except when we are cancelling movements to unavailable nodes
(which are unlikely to ever finish, so this is required to make
progress).
@ztlpn ztlpn force-pushed the pb-moving-partitions branch from daf5dac to ef07885 Compare May 18, 2023 15:03
@ztlpn ztlpn requested a review from mmaslankaprv May 18, 2023 15:23
@ztlpn
Copy link
Contributor Author

ztlpn commented May 18, 2023

PartitionBalancerTest.test_fuzz_admin_ops suspiciously failed but looks like this is just an instance of #9315 - partition movements are being scheduled, they are just slow in debug mode.

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did one full pass after todays discussion.. lgtm.

@ztlpn
Copy link
Contributor Author

ztlpn commented May 19, 2023

/ci-repeat 2
debug
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest

@ztlpn
Copy link
Contributor Author

ztlpn commented May 21, 2023

/ci-repeat 2
release
skip-units
dt-repeat=25
tests/rptest/tests/partition_balancer_test.py

For some reason (possibly linked to this PR) the rate of partition
balancer tests failures has recently increased quite a bit. This doesn't
look like a bug, partition movements simply can't proceed because RPC
append_entries calls are too slow. To unblock the PR, skip all partition
balancing tests in debug mode.
@ztlpn
Copy link
Contributor Author

ztlpn commented May 22, 2023

So I ran all partition balancer tests 50 times, and they only failed once (#9788), which looks pretty good in my opinion.

To unblock this PR (and lower the number of ci failures in general), I propose to disable running partition balancer tests (and possibly other tests that move a lot of partitions) in debug mode. Some of them are already disabled anyway.

@ztlpn ztlpn requested a review from bharathv May 22, 2023 09:52
@ztlpn ztlpn merged commit e8961dc into redpanda-data:dev May 22, 2023
@ztlpn ztlpn deleted the pb-moving-partitions branch November 27, 2023 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants