Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Fix offset for leader epoch request for read-replica #14403

Merged
merged 2 commits into from
Oct 25, 2023

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Oct 24, 2023

Read-replica uses local Raft group as a source of information for OffsetForLeaderEpoch request. This PR fixes this problem and adds a reproducer in form of ducktape test.

Fixes #14402

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Bug Fixes

  • Fix OffsetForLeaderEpoch for read-replicas

@Lazin Lazin requested review from andrwng and VladLazar October 24, 2023 18:01
@Lazin Lazin self-assigned this Oct 24, 2023
@Lazin Lazin force-pushed the pr/rrr-offset-for-leader-epoch branch from 8203a82 to 7a8d871 Compare October 24, 2023 18:09
_partition->is_remote_fetch_enabled()
&& _partition->cloud_data_available()) {
is_read_replica
|| (_partition->is_remote_fetch_enabled() && _partition->cloud_data_available())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the case is_read_replica && !cloud_data_available(), we should explicitly return nullopt. At least it seems like if the cloud partition is empty, we could hit some asserts:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test if the request is performed before we sync it returns -1, but I added the fix just in case

@piyushredpanda piyushredpanda added this to the v23.2.13 milestone Oct 24, 2023
RRR partition can return incorrect offset for epoch because it uses
information from local Raft group. This commit fixes the issue by always
using data from the cloud storage if the topic is a read-replica.
@Lazin Lazin force-pushed the pr/rrr-offset-for-leader-epoch branch from 7a8d871 to 523bb54 Compare October 25, 2023 08:08
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I'd maybe do a ci-repeat of the new test only to surface any potential flakiness.

Comment on lines +91 to +94
wait_until(has_leader,
timeout_sec=60,
backoff_sec=10,
err_msg="Failed to create a read-replica, no leadership")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use Admin.await_stable_leader

@Lazin
Copy link
Contributor Author

Lazin commented Oct 25, 2023

I was running the test locally with --repeat=100 and got no failures.

@Lazin Lazin merged commit 6de829e into redpanda-data:dev Oct 25, 2023
9 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-14403-v23.1.x-454 remotes/upstream/v23.1.x
git cherry-pick -x 209f170e9c935552b3114f490ec0432f938e8ce9 523bb5435c1d1c3ccc3aa2ca0b2aa2296c451b09

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix offset for leader epoch handling in RRR
5 participants