-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partition_balancer: futurize ticks to avoid reactor stalls with large number of topics #11183
Conversation
b8d558e
to
6a62042
Compare
/ci-repeat 2 |
Repeat runs had a failure in the following test that is interesting.
That seems like a bug in node_status_backend's RPC connection cache.. the backend uses its own connection cache and not application level cache.. it doesn't invalidate connections to dead brokers . The test restarts (picks a random action one of which is a restart) brokers and from that point on, the transport to the broker is broken and never replaced.. balancer eventually considers them as unavailable.. this is a serious bug, will fix in a separate patch..
|
6ef858f
to
c054e13
Compare
c86edb9
to
45686ac
Compare
for (auto it = topics.topics_iterator_begin(); | ||
it != topics.topics_iterator_end(); | ||
++it) { | ||
for (const auto& a : it->second.get_assignments()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think topic table revision check on operator dereference is insufficient here, because it will happen only once at the beginning of the loop, and we need to check after each yield (due to create_partition_cmd
).
This adaptor guards against iterator invalidatons across scheduling points by taking a user provided revision function as input. The expection is that the revision function returns a new revision anytime the underlying container changes in a way that can cause iterator invalidations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to rebase / force-push due to conflicts, waiting for CI.
mostly mechanical change that yields the control to avoid stalls.
.. to make the planner tick abortable.
aborts the balancer tick on node state changes.
Failures: (one new, seems unrelated) |
Fixes https://github.com/redpanda-data/core-internal/issues/550
Backports Required
Release Notes