-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: try to quiesce uploaders before leadership transfer #8560
cloud_storage: try to quiesce uploaders before leadership transfer #8560
Conversation
@@ -613,6 +616,23 @@ partition::transfer_leadership(std::optional<model::node_id> target) { | |||
} else if (_tm_stm) { | |||
stm_prepare_lock = co_await _tm_stm->prepare_transfer_leadership(); | |||
} | |||
|
|||
std::optional<ss::deferred_action<std::function<void()>>> complete_archiver; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the sequence is inverted? it should be possible to call that function after do_transfer_leadership
is it to guarantee that it's called in case if prepare_transfer_leadershipt
or do_transfer_leadership
throws?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it to guarantee that it's called in case if prepare_transfer_leadershipt or do_transfer_leadership throws?
Yes, exactly. complete_transfer_leadership is safe to call any time, so we set up the deferred_action
before trying to do anything, in case something throws.
8f19de2
to
e073a3c
Compare
Force push: updated the logic in ntp_archiver to only hold _uploads_active units while actually doing I/O, and drop it before going into a sleep backoff. Without this, we would see apparent failures to finish graceful leadership transfer, when really it was just an idle backoff that prevented the upload loop relinquishing _uploads_active. |
Previously these classes had transfer_leadership functions that wrapped the inner consensus::do_transfer_leadership. To make it easier to add functionality in partition::transfer leadership, change these to only do the preparatory steps they need, and then return their lock units for the caller to hold across the actual leadership transfer.
This is basically just a pause/resume feature, with a timeout on how long pause waits for completion.
Time limit on waiting for uploads to complete before a leadership transfer. If this is null, leadership transfers will proceed without waiting.
This substantially reduces the probability of leaving orphaned objects in the object store when partitions change leadership under load, e.g. during upgrades or leader balancing. This fixes a test failure that indirectly detects orphan objects by checking that topic deletion clears all objects. Fixes redpanda-data#8496
e073a3c
to
86cbfee
Compare
Tests were fully green: force pushed to update the "Timed out waiting" message from error to warn, which is it's correct severity (error was just to shake out cases where it was happening unexpectedly during testing) |
} | ||
|
||
void ntp_archiver::complete_transfer_leadership() { | ||
_paused = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be nice to add some debug logs here and at L1600, to rule out any potential confusion if uploads appear stuck
Failure is a |
It looks like the behavior in redpanda-data#8560 isn't always having the desired effect, but hard to see why.
It looks like the behavior in redpanda-data#8560 isn't always having the desired effect, but hard to see why.
This substantially reduces the probability of leaving orphaned
objects in the object store when partitions change leadership
under load, e.g. during upgrades or leader balancing.
This fixes a test failure that indirectly detects orphan
objects by checking that topic deletion clears all objects.
Fixes #8496
Backports Required
UX Changes
None
Release Notes
Improvements
cloud_storage_graceful_transfer_timeout_ms
is added, with a default of 5000ms. Setting this property to null disables the new behavior of waiting for uploads to complete before transferring leadership.