Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rps_threshold & max_rejection_probability #16742

Merged
merged 8 commits into from
Jun 11, 2021

Conversation

WeavingGao
Copy link
Contributor

Signed-off-by: gaoweiwen whereisgww@outlook.com

Commit Message:
Add two configuration items to the admission_control filter.

  1. rps_threshold indicates the minimum RPS threshold.
  2. max_rejection_probability indicates the maximum throttling percentage.

details in issue: #16392

Additional Description:
Risk Level: Medium
Testing: Unit test and integration test
Docs Changes: admission_control_filter.rst
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: gaoweiwen <whereisgww@outlook.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #16742 was opened by WeavingGao.

see: more, trace.

@WeavingGao
Copy link
Contributor Author

I'm sorry that this PR came a little late since the last discussion in the related issue(#16392) , because some other things have filled up my time recently. Anyway, it's ready to be reviewed.
cc @tonya11en @mattklein123

Signed-off-by: gaoweiwen <whereisgww@outlook.com>
@mattklein123
Copy link
Member

Assigning over to @tonya11en for first pass.

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included some high-level comments on the approach to the traffic threshold.

Comment on lines 95 to 98
// If the number of requests in the last second is less than this threshold, the request
// will not be rejected, even if the success rate is lower than sr_threshold.
// Defaults to 0.
config.core.v3.RuntimeUInt32 rps_threshold = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should only consider the last 1 second. It would make more sense to look at something like the minimum number of samples in the sampling window. It would still accomplish the same thing, but give better behavior for bursty request patterns.

Would that work for your particular use-case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more clear, you can still represent what I'm suggesting as an RPS threshold. It would just be an "average RPS" threshold across the entirety of the sampling window.

Comment on lines 27 to 32
P_{reject} = \left\{
\begin{array}{cl}
0 & \ (rps < rps\_threshold) \\
min({(\frac{n_{total} - s}{n_{total} + 1})}^\frac{1}{aggression}\ ,\ max\_reject\_probability) & \ (rps \geq rps\_threshold)
\end{array}
\right.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a screenshot of this rendered in the docs?

I also don't think it's necessary to change the function here as you're introducing circumstances under which this function applies, not changes to the behavior of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot here

Your comment reminds me that this does make the formula more complicated. It may be more appropriate to explain these two configuration items after the original formula.

@tonya11en
Copy link
Member

/wait

Signed-off-by: gaoweiwen <whereisgww@outlook.com>
@WeavingGao
Copy link
Contributor Author

Changed LastSampleCounts to averageRps and updated the documentation.

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Tests are very thorough!

The approach makes sense, so there's just a few minor comments.

@@ -41,6 +41,11 @@ where,
rejection probability will be higher for higher success rates. See `Aggression`_ for a more
detailed explanation.

In addition, there are two more configuration item:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In addition, there are two more configuration item:
Note that there are additional parameters that affect the rejection probability:

@@ -68,6 +87,10 @@ Http::FilterHeadersStatus AdmissionControlFilter::decodeHeaders(Http::RequestHea
return Http::FilterHeadersStatus::Continue;
}

if (config_->getController().averageRps() < config_->rpsThreshold()) {
return Http::FilterHeadersStatus::Continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider throwing a debug log in here.

@@ -17,6 +17,16 @@ ThreadLocalControllerImpl::ThreadLocalControllerImpl(TimeSource& time_source,
std::chrono::seconds sampling_window)
: time_source_(time_source), sampling_window_(sampling_window) {}

uint32_t ThreadLocalControllerImpl::averageRps() const {
// In the very beginning of sampling, the average rps might be very big, so just skipping from it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what you mean here. Why do you want to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first few requests in the first second may have a short interval, resulting in a big average rps. So skip the calculation of the first second.

Copy link
Member

@tonya11en tonya11en Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, because you're allowing fractional seconds due to the age being measured in microseconds.

You can just set the min sample age to 1sec. Cast to std::chrono::seconds instead of a double and do the integer division. There's no reason for the extra accuracy if the config is specified in uint32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

Comment on lines +375 to +384
// Validate max rejection probability
EXPECT_CALL(runtime_.snapshot_, getDouble("foo.aggression", 1.0)).WillRepeatedly(Return(1.0));
EXPECT_CALL(runtime_.snapshot_, getDouble("foo.threshold", 100.0)).WillRepeatedly(Return(100.0));
EXPECT_CALL(runtime_.snapshot_, getDouble("foo.max_rejection_probability", 80.0))
.WillRepeatedly(Return(10.0));

verifyProbabilities(100, 0.0);
verifyProbabilities(95, 0.05);
verifyProbabilities(80, 0.1);
verifyProbabilities(0, 0.1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this its own test case.

Comment on lines 109 to 112
// Validate historical_data_.size() < 2
tlc_.recordSuccess();
tlc_.recordFailure();
EXPECT_EQ(0, tlc_.averageRps());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on this above, but just reiterating I'm a bit confused as to the purpose.

@@ -100,6 +100,34 @@ TEST_F(ThreadLocalControllerTest, VerifyMemoryUsage) {
EXPECT_EQ(RequestData(3, 3), tlc_.requestCounts());
}

// Test for function: averageRps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminate the comments with periods pls.

@tonya11en
Copy link
Member

/wait

Signed-off-by: gaoweiwen <whereisgww@outlook.com>
@WeavingGao
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16742 (comment) was created by @WeavingGao.

see: more, trace.

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 1 more comment. Looking good otherwise.

/wait

Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you'll also need to update the release notes. See:
https://github.com/envoyproxy/envoy/blob/main/docs/root/version_history/current.rst

Mention the new features and a change in the default behavior (max rejection probability). We should be good for a maintainer to take a look once that's done.

Comment on lines 24 to 27
using namespace std::chrono;
auto count_of_seconds = duration_cast<seconds>(ageOfOldestSample()).count();

return count_of_seconds == 0 ? 0 : global_data_.requests / count_of_seconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do something like below. There's just less going on that way IMO.

Suggested change
using namespace std::chrono;
auto count_of_seconds = duration_cast<seconds>(ageOfOldestSample()).count();
return count_of_seconds == 0 ? 0 : global_data_.requests / count_of_seconds;
using std::chrono::seconds;
seconds secs = std::max(seconds(1), ageOfOldestSample());
return global_data_.requests / secs.count();

@tonya11en
Copy link
Member

/wait

Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I think it's ready for a maintainer pass @mattklein123

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small doc question/comment, thanks.

/wait

Comment on lines +124 to +125
* Requests will never be rejeted from this filter if the RPS is lower than 5.
* Rejection probability will never exceed 80% even if the failure rate is 100%.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is true and these values are clamped, can you make this more clear in the API docs? If it's not true can you make it more clear here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true all the time-- it's just explaining how to read the configuration above. Github cut out this line just above the additions:

The above configuration can be understood as follows:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK sorry, I missed that, thanks.

@WeavingGao
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16742 (comment) was created by @WeavingGao.

see: more, trace.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit ce1882c into envoyproxy:main Jun 11, 2021
@alyssawilk
Copy link
Contributor

alyssawilk commented Jun 22, 2021

I think this may be causing CI flakes
#17067

could you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants