Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-5749 retry on dns failure #22327

Conversation

michael-redpanda
Copy link
Contributor

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Bug Fixes

  • Kafka client used by audit system will now properly handle DNS lookup failure errors

During a rolling upgrade in cloud, it was observed that RP's kafka
client would attempt to connect to the 'reserve' node after it was
decomissioned.  This was because the error code (C-Ares ENOTFOUND) was
not treated as a retriable error.  This change checks for the above
error code when attempting to connect to a broker and if it is
encountered, treats it as a retriable error.

Signed-off-by: Michael Boquard <michael@redpanda.com>
Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda requested a review from a team July 25, 2024 15:46
@michael-redpanda michael-redpanda self-assigned this Jul 25, 2024
@michael-redpanda michael-redpanda requested review from pgellert and removed request for a team July 25, 2024 15:46
@michael-redpanda michael-redpanda changed the title Core 5749 retry on dns failure CORE-5749 retry on dns failure Jul 25, 2024
@michael-redpanda michael-redpanda requested a review from BenPope July 25, 2024 15:46
Comment on lines +46 to +47
e.code().category() == std::system_category()
|| e.code().category() == std::generic_category()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've been reading the cppreference pages, and i still don't quite understand what the difference is between these... oh well!

Copy link
Member

@BenPope BenPope Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "correct" way is not to inspect the category, but to compare with the error enum:

    if (e.code().category() == ss::tls::error_category()) {
        return absl::c_any_of(
          ss_tls_reconnect_errors, [v](int ec) { return v == ec; });
    } else if (
      e.code() == std::errc::connection_refused
      || e.code() == std::errc::host_unreachable
      || e.code() == std::errc::timed_out
      ...

operator== provides a mapping between std::system_category and std::generic_category, I can't remember which is which, but on the platforms we care about , one is a subset of the other, and that subset has the same numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thx

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 25, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/52044#0190eaee-0738-4c73-b8e7-0cd4b5123a4b:

"rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure"

new failures in https://buildkite.com/redpanda/redpanda/builds/52044#0190eaed-614b-4046-a895-eb78caf7c03d:

"rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure"

new failures in https://buildkite.com/redpanda/redpanda/builds/52044#0190eb44-8f4b-45f1-8718-935118a82900:

"rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure"

@vbotbuildovich
Copy link
Collaborator

@michael-redpanda
Copy link
Contributor Author

michael-redpanda commented Jul 26, 2024

CI Failure:

@michael-redpanda michael-redpanda merged commit 7b0def8 into redpanda-data:dev Jul 26, 2024
18 of 21 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22327-v24.1.x-323 remotes/upstream/v24.1.x
git cherry-pick -x d79ab902239e382646bf8a81ccb14af56b71d524 da09267f75100b3f00d44df8fd9f130818982886

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22327-v23.3.x-626 remotes/upstream/v23.3.x
git cherry-pick -x d79ab902239e382646bf8a81ccb14af56b71d524 da09267f75100b3f00d44df8fd9f130818982886

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants