Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: try to quiesce uploaders before leadership transfer #8560

Merged
merged 5 commits into from
Feb 16, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Feb 1, 2023

This substantially reduces the probability of leaving orphaned
objects in the object store when partitions change leadership
under load, e.g. during upgrades or leader balancing.

This fixes a test failure that indirectly detects orphan
objects by checking that topic deletion clears all objects.

Fixes #8496

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

None

Release Notes

Improvements

  • Partition leadership transfers now wait for tiered storage uploads to finish, resulting in a lower probability of orphan objects in the object storage bucket. These objects are not a data integrity issue, but could result in a small amount of extra space used.
  • Cluster configuration property cloud_storage_graceful_transfer_timeout_ms is added, with a default of 5000ms. Setting this property to null disables the new behavior of waiting for uploads to complete before transferring leadership.

@jcsp jcsp requested review from Lazin and andrwng February 1, 2023 19:27
src/v/cluster/partition.cc Outdated Show resolved Hide resolved
src/v/archival/ntp_archiver_service.cc Show resolved Hide resolved
src/v/archival/ntp_archiver_service.cc Show resolved Hide resolved
@@ -613,6 +616,23 @@ partition::transfer_leadership(std::optional<model::node_id> target) {
} else if (_tm_stm) {
stm_prepare_lock = co_await _tm_stm->prepare_transfer_leadership();
}

std::optional<ss::deferred_action<std::function<void()>>> complete_archiver;
Copy link
Contributor

@Lazin Lazin Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the sequence is inverted? it should be possible to call that function after do_transfer_leadership
is it to guarantee that it's called in case if prepare_transfer_leadershipt or do_transfer_leadership throws?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it to guarantee that it's called in case if prepare_transfer_leadershipt or do_transfer_leadership throws?

Yes, exactly. complete_transfer_leadership is safe to call any time, so we set up the deferred_action before trying to do anything, in case something throws.

@jcsp jcsp force-pushed the cloud-storage-leadership-transfer branch from 8f19de2 to e073a3c Compare February 15, 2023 15:20
@jcsp
Copy link
Contributor Author

jcsp commented Feb 15, 2023

Force push: updated the logic in ntp_archiver to only hold _uploads_active units while actually doing I/O, and drop it before going into a sleep backoff. Without this, we would see apparent failures to finish graceful leadership transfer, when really it was just an idle backoff that prevented the upload loop relinquishing _uploads_active.

jcsp added 5 commits February 15, 2023 17:57
Previously these classes had transfer_leadership functions that
wrapped the inner consensus::do_transfer_leadership.  To make it
easier to add functionality in partition::transfer leadership, change
these to only do the preparatory steps they need, and then return
their lock units for the caller to hold across the actual
leadership transfer.
This is basically just a pause/resume feature, with
a timeout on how long pause waits for completion.
Time limit on waiting for uploads to complete before a leadership
transfer.  If this is null, leadership transfers will proceed without
waiting.
This substantially reduces the probability of leaving orphaned
objects in the object store when partitions change leadership
under load, e.g. during upgrades or leader balancing.

This fixes a test failure that indirectly detects orphan
objects by checking that topic deletion clears all objects.

Fixes redpanda-data#8496
@jcsp jcsp force-pushed the cloud-storage-leadership-transfer branch from e073a3c to 86cbfee Compare February 15, 2023 17:57
@jcsp
Copy link
Contributor Author

jcsp commented Feb 15, 2023

Tests were fully green: force pushed to update the "Timed out waiting" message from error to warn, which is it's correct severity (error was just to shake out cases where it was happening unexpectedly during testing)

@jcsp jcsp requested review from andrwng and Lazin February 15, 2023 17:59
}

void ntp_archiver::complete_transfer_leadership() {
_paused = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be nice to add some debug logs here and at L1600, to rule out any potential confusion if uploads appear stuck

@jcsp
Copy link
Contributor Author

jcsp commented Feb 16, 2023

@jcsp jcsp merged commit 8ae0e3a into redpanda-data:dev Feb 16, 2023
@jcsp jcsp deleted the cloud-storage-leadership-transfer branch February 16, 2023 11:39
jcsp added a commit to jcsp/redpanda that referenced this pull request Feb 16, 2023
It looks like the behavior in
redpanda-data#8560 isn't
always having the desired effect, but hard to see why.
jcsp added a commit to jcsp/redpanda that referenced this pull request Feb 17, 2023
It looks like the behavior in
redpanda-data#8560 isn't
always having the desired effect, but hard to see why.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants