-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add rps_threshold & max_rejection_probability #16742
Conversation
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
I'm sorry that this PR came a little late since the last discussion in the related issue(#16392) , because some other things have filled up my time recently. Anyway, it's ready to be reviewed. |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Assigning over to @tonya11en for first pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included some high-level comments on the approach to the traffic threshold.
// If the number of requests in the last second is less than this threshold, the request | ||
// will not be rejected, even if the success rate is lower than sr_threshold. | ||
// Defaults to 0. | ||
config.core.v3.RuntimeUInt32 rps_threshold = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should only consider the last 1 second. It would make more sense to look at something like the minimum number of samples in the sampling window. It would still accomplish the same thing, but give better behavior for bursty request patterns.
Would that work for your particular use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more clear, you can still represent what I'm suggesting as an RPS threshold. It would just be an "average RPS" threshold across the entirety of the sampling window.
P_{reject} = \left\{ | ||
\begin{array}{cl} | ||
0 & \ (rps < rps\_threshold) \\ | ||
min({(\frac{n_{total} - s}{n_{total} + 1})}^\frac{1}{aggression}\ ,\ max\_reject\_probability) & \ (rps \geq rps\_threshold) | ||
\end{array} | ||
\right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include a screenshot of this rendered in the docs?
I also don't think it's necessary to change the function here as you're introducing circumstances under which this function applies, not changes to the behavior of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment reminds me that this does make the formula more complicated. It may be more appropriate to explain these two configuration items after the original formula.
/wait |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Changed LastSampleCounts to averageRps and updated the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. Tests are very thorough!
The approach makes sense, so there's just a few minor comments.
@@ -41,6 +41,11 @@ where, | |||
rejection probability will be higher for higher success rates. See `Aggression`_ for a more | |||
detailed explanation. | |||
|
|||
In addition, there are two more configuration item: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, there are two more configuration item: | |
Note that there are additional parameters that affect the rejection probability: |
@@ -68,6 +87,10 @@ Http::FilterHeadersStatus AdmissionControlFilter::decodeHeaders(Http::RequestHea | |||
return Http::FilterHeadersStatus::Continue; | |||
} | |||
|
|||
if (config_->getController().averageRps() < config_->rpsThreshold()) { | |||
return Http::FilterHeadersStatus::Continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider throwing a debug log in here.
@@ -17,6 +17,16 @@ ThreadLocalControllerImpl::ThreadLocalControllerImpl(TimeSource& time_source, | |||
std::chrono::seconds sampling_window) | |||
: time_source_(time_source), sampling_window_(sampling_window) {} | |||
|
|||
uint32_t ThreadLocalControllerImpl::averageRps() const { | |||
// In the very beginning of sampling, the average rps might be very big, so just skipping from it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what you mean here. Why do you want to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first few requests in the first second may have a short interval, resulting in a big average rps. So skip the calculation of the first second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, because you're allowing fractional seconds due to the age being measured in microseconds.
You can just set the min sample age to 1sec. Cast to std::chrono::seconds instead of a double and do the integer division. There's no reason for the extra accuracy if the config is specified in uint32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable.
// Validate max rejection probability | ||
EXPECT_CALL(runtime_.snapshot_, getDouble("foo.aggression", 1.0)).WillRepeatedly(Return(1.0)); | ||
EXPECT_CALL(runtime_.snapshot_, getDouble("foo.threshold", 100.0)).WillRepeatedly(Return(100.0)); | ||
EXPECT_CALL(runtime_.snapshot_, getDouble("foo.max_rejection_probability", 80.0)) | ||
.WillRepeatedly(Return(10.0)); | ||
|
||
verifyProbabilities(100, 0.0); | ||
verifyProbabilities(95, 0.05); | ||
verifyProbabilities(80, 0.1); | ||
verifyProbabilities(0, 0.1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this its own test case.
// Validate historical_data_.size() < 2 | ||
tlc_.recordSuccess(); | ||
tlc_.recordFailure(); | ||
EXPECT_EQ(0, tlc_.averageRps()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented on this above, but just reiterating I'm a bit confused as to the purpose.
@@ -100,6 +100,34 @@ TEST_F(ThreadLocalControllerTest, VerifyMemoryUsage) { | |||
EXPECT_EQ(RequestData(3, 3), tlc_.requestCounts()); | |||
} | |||
|
|||
// Test for function: averageRps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terminate the comments with periods pls.
/wait |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left 1 more comment. Looking good otherwise.
/wait
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you'll also need to update the release notes. See:
https://github.com/envoyproxy/envoy/blob/main/docs/root/version_history/current.rst
Mention the new features and a change in the default behavior (max rejection probability). We should be good for a maintainer to take a look once that's done.
using namespace std::chrono; | ||
auto count_of_seconds = duration_cast<seconds>(ageOfOldestSample()).count(); | ||
|
||
return count_of_seconds == 0 ? 0 : global_data_.requests / count_of_seconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do something like below. There's just less going on that way IMO.
using namespace std::chrono; | |
auto count_of_seconds = duration_cast<seconds>(ageOfOldestSample()).count(); | |
return count_of_seconds == 0 ? 0 : global_data_.requests / count_of_seconds; | |
using std::chrono::seconds; | |
seconds secs = std::max(seconds(1), ageOfOldestSample()); | |
return global_data_.requests / secs.count(); |
/wait |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I think it's ready for a maintainer pass @mattklein123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small doc question/comment, thanks.
/wait
* Requests will never be rejeted from this filter if the RPS is lower than 5. | ||
* Rejection probability will never exceed 80% even if the failure rate is 100%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is true and these values are clamped, can you make this more clear in the API docs? If it's not true can you make it more clear here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true all the time-- it's just explaining how to read the configuration above. Github cut out this line just above the additions:
The above configuration can be understood as follows:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK sorry, I missed that, thanks.
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I think this may be causing CI flakes could you take a look? |
…oyproxy#16742) Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen whereisgww@outlook.com
Commit Message:
Add two configuration items to the admission_control filter.
rps_threshold
indicates the minimum RPS threshold.max_rejection_probability
indicates the maximum throttling percentage.details in issue: #16392
Additional Description:
Risk Level: Medium
Testing: Unit test and integration test
Docs Changes: admission_control_filter.rst
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]