-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overload: tcp connection refusal overload action #13311
overload: tcp connection refusal overload action #13311
Conversation
This will allow creating an overload action that rejects (accepts the connection and then closes it) some fraction of connections in response to overload conditions. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
I still need to add documentation, but I wanted to get the code parts in shape first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall - thanks for picking this up!
@@ -148,6 +148,14 @@ void ConnectionHandlerImpl::enableListeners() { | |||
} | |||
} | |||
|
|||
void ConnectionHandlerImpl::setListenerRejectFraction(float reject_fraction) { | |||
disable_listeners_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems strange to change this boolean without resuming listening.
Also if new listeners are added won't they be added listening but without a rejection fraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this was a bad copy-paste on my part. Fixed in the latest commit, and added tests.
return true; | ||
} | ||
|
||
return random_generator.random() < |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe comment why this is overflow-safe?
* of the level of saturation. | ||
*/ | ||
bool isRandomizedActive(Random::RandomGenerator& random_generator) const { | ||
if (isInactive()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just for performance, or do the edges not work well with the casts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was stolen from #13120, but I assume it's for performance - don't grab a random number if it's not necessary. That's why this PR is still a draft.
Signed-off-by: Alex Konradi <akonradi@google.com>
Address feedback around adding new listeners when the reject fraction is already set, and add tests. Signed-off-by: Alex Konradi <akonradi@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the structure looks good overall, so want to ping when you have docs and relnotes and such? It wouldn't hurt to have an integration test too. Even if we can only usefully test the case where we're rejecting 100%, we could test accepting, config reload, and rejecting.
@@ -59,7 +59,7 @@ void TcpListenerImpl::onSocketEvent(short flags) { | |||
break; | |||
} | |||
|
|||
if (rejectCxOverGlobalLimit()) { | |||
if (rejectCxOverGlobalLimit() || reject_fraction_.isRandomizedActive(random_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #13305 lands (which I think will before this) I thin it'd be useful to tag why the connection was closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can tag anything here since this happens before an Envoy connection object is created, so there's nothing to annotate.
@@ -59,7 +59,7 @@ void TcpListenerImpl::onSocketEvent(short flags) { | |||
break; | |||
} | |||
|
|||
if (rejectCxOverGlobalLimit()) { | |||
if (rejectCxOverGlobalLimit() || reject_fraction_.isRandomizedActive(random_)) { | |||
// The global connection limit has been reached. | |||
io_handle->close(); | |||
cb_.onReject(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this increments stats_.downstream_global_cx_overflow_.inc()
Think we should have a separate limit for overflow?
If not maybe at least update the docs to note the stat covers both overflow cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a separate stat.
…-refusal Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@alyssawilk I've updated the release notes and documentation, PTAL |
Signed-off-by: Alex Konradi <akonradi@google.com>
Looks great thanks. Throwing over to Matt for !googler review but in the meantime if you're up for updating the TODOs in PR description that'd be helpful |
Thanks, updated TODOs (that I had completely forgotten about) |
Windows CI failure looks real. Can you take a look? /wait |
Relax the ordering constraints for events that occur asynchronously on different ends of the connection while maintaining requirements for relative ordering on each end. Signed-off-by: Alex Konradi <akonradi@google.com>
Relaxed some of the ordering requirements in the test for things that shouldn't necessarily be synchronized. Looks like the tests are passing now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! LGTM with small comments. Thank you.
/wait
@@ -118,5 +124,9 @@ void TcpListenerImpl::enable() { file_event_->setEnabled(Event::FileReadyType::R | |||
|
|||
void TcpListenerImpl::disable() { file_event_->setEnabled(0); } | |||
|
|||
void TcpListenerImpl::setRejectFraction(const float reject_fraction) { | |||
reject_fraction_ = reject_fraction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you ASSERT 0 <= reject_fraction <= 1? I think the Bernoulli function correctly handles out of bounds but might as well try to check for obvious config/code bugs. (Bonus points would be to introduce a struct wrapper for the float which would check this at point of assignment, etc. but up to you.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
stats_.downstream_global_cx_overflow_.inc(); | ||
break; | ||
case RejectCause::OverloadAction: | ||
stats_.downstream_cx_overload_reject_.inc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify this stat increment in one of your tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a unit test.
Signed-off-by: Alex Konradi <akonradi@google.com>
…-refusal Signed-off-by: Alex Konradi <akonradi@google.com>
Needs main merge. /wait |
…-refusal Signed-off-by: Alex Konradi <akonradi@google.com>
Yep, I managed to add new tests in the same place both in this PR and in 13515 🤦 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
* master: (22 commits) http: using CONNECT_ERROR for HTTP/2 (envoyproxy#13519) listener: respect address.pipe.mode (it didn't work) (envoyproxy#13493) examples: Fix more deprecations/warnings in configs (envoyproxy#13529) overload: tcp connection refusal overload action (envoyproxy#13311) tcp: towards pluggable upstreams (envoyproxy#13331) conn_pool: fixing comments (envoyproxy#13520) Prevent SEGFAULT when disabling listener (envoyproxy#13515) Convert overload manager config literals to YAML (envoyproxy#13518) Fix runtime feature variable name (envoyproxy#13533) dependencies: refactor repository location schema utils, cleanups. (envoyproxy#13452) router: fix an invalid ASSERT when encoding metadata frames in the router. (envoyproxy#13511) http2: Proactively disconnect connections flooded when resetting stream (envoyproxy#13482) ci use azp to sync filter example (envoyproxy#13501) mongo_proxy: support configurable command list for metrics (envoyproxy#13494) http local rate limit: note token bucket is shared (envoyproxy#13525) wasm/extensions: Wasm extension policy. (envoyproxy#13526) http: removing envoy.reloadable_features.http1_flood_protection (envoyproxy#13508) build: update ppc64le CI build status shield (envoyproxy#13521) dependencies: enforce dependency shepherd sign-off via RepoKitteh. (envoyproxy#13522) Add no_traffic_healthy_interval (envoyproxy#13336) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Commit Message: Add an overload action to reject (accept and then close) TCP connections
Additional Description:
Add an action
"envoy.overload_actions.reject_incoming_connections"
that will rejectincoming connections when Envoy is overloaded. It will randomly choose whether to
reject connections based on the overload factor, with expected behavior at the ends
of the range (reject no connections at 0, reject all at saturation).
Risk Level: low
Testing: ran unit tests
Docs Changes: documented new overload action
Release Notes: documented new overload action