-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject PP/SR HTTP requests if cross shard semaphore has been exhausted #15977
Reject PP/SR HTTP requests if cross shard semaphore has been exhausted #15977
Conversation
src/v/config/configuration.cc
Outdated
, pp_sr_smp_max_non_local_requests( | ||
*this, | ||
"pp_sr_smp_max_non_local_requests", | ||
"Maximum number of x-core requests pending in Panda Proxy and Schema " | ||
"Registry seastar::smp group. (for more details look at " | ||
"`seastar::smp_service_group` documentation)", | ||
{.needs_restart = needs_restart::yes, .visibility = visibility::user}, | ||
std::nullopt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we backport things if there's a new cluster config (right?). If we do want this backported, I can drop the commits that add this and submit the change in a separate PR.
47f1acc
to
3359a67
Compare
Signed-off-by: Michael Boquard <michael@redpanda.com>
Signed-off-by: Michael Boquard <michael@redpanda.com>
3359a67
to
b58bba6
Compare
Do we need the SMP resource group anymore? |
I'd like @BenPope 's take on this question, I think it's a valid one. |
I'm not sure what the suggested alternative is? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to limit the number of in-flight requests; so does it fix #14116?
I.e., would it be better to limit the number of in-flight requests by applying backpressure rather than rejecting valid requests?
The issue that we had in relation to the incident in #14116 was that new requests were just hanging out in the semaphores wait list but we had no idea that that was happening. By checking whether the semaphore has been exhausted before initiating the cross-shard call, we can reject the new request.
I'm not sure how the behavior to the client would be any different. How would you limit the number of in-flight requests without rejecting new ones? |
It's not just new requests, though, it's requests that may be half-processed.
By not processing them until semaphore can be obtained (or timed-out), Unfortunately, we don't yet stream the body, so it's not ideal for something like a produce right now. |
Signed-off-by: Michael Boquard <michael@redpanda.com>
Signed-off-by: Michael Boquard <michael@redpanda.com>
Signed-off-by: Michael Boquard <michael@redpanda.com>
Signed-off-by: Michael Boquard <michael@redpanda.com>
Signed-off-by: Michael Boquard <michael@redpanda.com>
If inflight semaphore is exhausted, then return a 429 error. Signed-off-by: Michael Boquard <michael@redpanda.com>
b58bba6
to
5a4d0ed
Compare
Force push
|
Within the smp resource config there is an option:
Correct me if i'm wrong but doesn't this configurable option solve the same issue that Mike is attempting to solve with the adjustable semaphore? EDIT: I guess i'll try to answer my own question, maybe the logical difference here is the smp group enforces the max is number of futures vs what this PR is attempting to solve which is the max number of "requests" |
I suspect that the problem is that cross-core requests are being made several times without "unwinding" first; a request can get stuck half-way waiting on the semaphore. Timing it out isn't ideal as only some of the work may have been applied. Limiting requests much lower than the nonlocal limit should resolve this problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
_inflight_config_binding.watch( | ||
[this]() { _inflight_sem.set_capacity(_inflight_config_binding()); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_inflight_sem
could be destructed at this point (I don't see another mechanism that unsubscribes the watch, or otherwise sequences the destruction order.
_inflight_config_binding.watch( | ||
[this]() { _inflight_sem.set_capacity(_inflight_config_binding()); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_inflight_sem
could be destructed at this point (I don't see another mechanism that unsubscribes the watch, or otherwise sequences the destruction order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't destruction sequence order dictated by the class? _inflight_config_binding
is constructed after _inflight_sem
and therefor should be destructed before the semaphore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I find relying on destructor order is incredibly sneaky and usually deserves a comment somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't destruction sequence order dictated by the class?
_inflight_config_binding
is constructed after_inflight_sem
and therefor should be destructed before the semaphore?
You're correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Why is this considered "not a bug fix" and not backported? |
Fixes: #14116
Cross shard calls in SR/PP utilize an SMP resource group to limit the number of cross shard calls that
are made. However, if the semaphore has been fully consumed, the call simply hangs. Customers have
experienced issues where, under heavy load, SR appears to hang and not response. Now, instead, once the
semaphore has been completely exhausted, new HTTP requests that attempt cross shard calls will be
rejected with a 500 error.
Backports Required
Release Notes
Improvements