cloud_storage: try to quiesce uploaders before leadership transfer #8560

jcsp · 2023-02-01T19:27:10Z

This substantially reduces the probability of leaving orphaned
objects in the object store when partitions change leadership
under load, e.g. during upgrades or leader balancing.

This fixes a test failure that indirectly detects orphan
objects by checking that topic deletion clears all objects.

Fixes #8496

Backports Required

UX Changes

None

Release Notes

Improvements

Partition leadership transfers now wait for tiered storage uploads to finish, resulting in a lower probability of orphan objects in the object storage bucket. These objects are not a data integrity issue, but could result in a small amount of extra space used.
Cluster configuration property cloud_storage_graceful_transfer_timeout_ms is added, with a default of 5000ms. Setting this property to null disables the new behavior of waiting for uploads to complete before transferring leadership.

src/v/cluster/partition.cc

src/v/archival/ntp_archiver_service.cc

Lazin · 2023-02-02T11:03:50Z

src/v/cluster/partition.cc

@@ -613,6 +616,23 @@ partition::transfer_leadership(std::optional<model::node_id> target) {
    } else if (_tm_stm) {
        stm_prepare_lock = co_await _tm_stm->prepare_transfer_leadership();
    }
+
+    std::optional<ss::deferred_action<std::function<void()>>> complete_archiver;


why the sequence is inverted? it should be possible to call that function after do_transfer_leadership
is it to guarantee that it's called in case if prepare_transfer_leadershipt or do_transfer_leadership throws?

is it to guarantee that it's called in case if prepare_transfer_leadershipt or do_transfer_leadership throws?

Yes, exactly. complete_transfer_leadership is safe to call any time, so we set up the deferred_action before trying to do anything, in case something throws.

jcsp · 2023-02-15T15:26:08Z

Force push: updated the logic in ntp_archiver to only hold _uploads_active units while actually doing I/O, and drop it before going into a sleep backoff. Without this, we would see apparent failures to finish graceful leadership transfer, when really it was just an idle backoff that prevented the upload loop relinquishing _uploads_active.

Previously these classes had transfer_leadership functions that wrapped the inner consensus::do_transfer_leadership. To make it easier to add functionality in partition::transfer leadership, change these to only do the preparatory steps they need, and then return their lock units for the caller to hold across the actual leadership transfer.

This is basically just a pause/resume feature, with a timeout on how long pause waits for completion.

Time limit on waiting for uploads to complete before a leadership transfer. If this is null, leadership transfers will proceed without waiting.

This substantially reduces the probability of leaving orphaned objects in the object store when partitions change leadership under load, e.g. during upgrades or leader balancing. This fixes a test failure that indirectly detects orphan objects by checking that topic deletion clears all objects. Fixes redpanda-data#8496

jcsp · 2023-02-15T17:58:52Z

Tests were fully green: force pushed to update the "Timed out waiting" message from error to warn, which is it's correct severity (error was just to shake out cases where it was happening unexpectedly during testing)

andrwng · 2023-02-15T18:51:51Z

src/v/archival/ntp_archiver_service.cc

+}
+
+void ntp_archiver::complete_transfer_leadership() {
+    _paused = false;


nit: might be nice to add some debug logs here and at L1600, to rule out any potential confusion if uploads appear stuck

jcsp · 2023-02-16T11:39:14Z

Failure is a test_node_operations case:

CI Failure (Assertion _state && !_state->available()' failed) in RandomNodeOperationsTest.test_node_operations` #8919

It looks like the behavior in redpanda-data#8560 isn't always having the desired effect, but hard to see why.

jcsp requested review from Lazin and andrwng February 1, 2023 19:27

github-actions bot added the area/redpanda label Feb 1, 2023

andrwng reviewed Feb 1, 2023

View reviewed changes

src/v/cluster/partition.cc Outdated Show resolved Hide resolved

src/v/archival/ntp_archiver_service.cc Show resolved Hide resolved

src/v/archival/ntp_archiver_service.cc Show resolved Hide resolved

Lazin reviewed Feb 2, 2023

View reviewed changes

jcsp force-pushed the cloud-storage-leadership-transfer branch from 8f19de2 to e073a3c Compare February 15, 2023 15:20

jcsp added 5 commits February 15, 2023 17:57

cluster: refactor partition::transfer_leadership to .cc

b9fe43f

archival: add leadership transfer hooks to ntp_archiver

9d611ea

This is basically just a pause/resume feature, with a timeout on how long pause waits for completion.

config: add cloud_storage_graceful_transfer_timeout

b5d9395

Time limit on waiting for uploads to complete before a leadership transfer. If this is null, leadership transfers will proceed without waiting.

jcsp force-pushed the cloud-storage-leadership-transfer branch from e073a3c to 86cbfee Compare February 15, 2023 17:57

jcsp requested review from andrwng and Lazin February 15, 2023 17:59

andrwng approved these changes Feb 15, 2023

View reviewed changes

jcsp merged commit 8ae0e3a into redpanda-data:dev Feb 16, 2023

jcsp deleted the cloud-storage-leadership-transfer branch February 16, 2023 11:39

jcsp added a commit to jcsp/redpanda that referenced this pull request Feb 16, 2023

cluster: debug logging around archival leadership transfer

875220d

It looks like the behavior in redpanda-data#8560 isn't always having the desired effect, but hard to see why.

jcsp mentioned this pull request Feb 16, 2023

cluster: fix leader balancer using low level leadership transfer interface #8941

Merged

7 tasks

jcsp added a commit to jcsp/redpanda that referenced this pull request Feb 17, 2023

cluster: debug logging around archival leadership transfer

97f70a7

It looks like the behavior in redpanda-data#8560 isn't always having the desired effect, but hard to see why.

jcsp mentioned this pull request Feb 23, 2023

cloud_storage: bucket scrub #9072

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: try to quiesce uploaders before leadership transfer #8560

cloud_storage: try to quiesce uploaders before leadership transfer #8560

jcsp commented Feb 1, 2023 •

edited

Loading

Lazin Feb 2, 2023 •

edited

Loading

jcsp Feb 15, 2023

jcsp commented Feb 15, 2023

jcsp commented Feb 15, 2023

andrwng Feb 15, 2023

jcsp commented Feb 16, 2023

cloud_storage: try to quiesce uploaders before leadership transfer #8560

cloud_storage: try to quiesce uploaders before leadership transfer #8560

Conversation

jcsp commented Feb 1, 2023 • edited Loading

Backports Required

UX Changes

Release Notes

Improvements

Lazin Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

jcsp Feb 15, 2023

Choose a reason for hiding this comment

jcsp commented Feb 15, 2023

jcsp commented Feb 15, 2023

andrwng Feb 15, 2023

Choose a reason for hiding this comment

jcsp commented Feb 16, 2023

jcsp commented Feb 1, 2023 •

edited

Loading

Lazin Feb 2, 2023 •

edited

Loading