-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fault injection: add support for setting gRPC status #10841
Conversation
Bringing fork up to date
Sync to base
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
/lgtm api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks generally looks great! Flushing out a few comments to get started. Thank you!
/wait
@@ -41,6 +42,9 @@ message FaultAbort { | |||
// HTTP status code to use to abort the HTTP request. | |||
uint32 http_status = 2 [(validate.rules).uint32 = {lt: 600 gte: 200}]; | |||
|
|||
// gRPC status code to use to abort the gRPC request. | |||
uint32 grpc_status = 5 [(validate.rules).uint32 = {lte: 16}]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should drop the "lte" restriction here so that we can test unknown gRPC codes above those well defined. This is a really common source of bugs (including in envoy itself) so it will be nice to be able to test this. I might clarify in the docs that we allow testing any numeric code, even those that are not well defined. (docs need updating also)
@@ -42,6 +42,12 @@ x-envoy-fault-abort-request | |||
In order for the header to work, :ref:`header_abort | |||
<envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set. | |||
|
|||
x-envoy-fault-abort-grpc-request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a % header here also? cc @Augustyniak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage
HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage
HTTP header is defined).
Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage
HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed one thing when I posted my initial comment here.
Initially, I thought that GRPC aborts would be applied on top of http aborts meaning something like:
- Requests comes in
- Check whether to abort the request with http fault - look at
x-envoy-fault-abort-request-percentage
andx-envoy-fault-abort-request
HTTP headers. - If no http fault was applied (because percentage was <100% and we decided not to apply HTTP fault) check whether grpc faults should be applied - look at
x-envoy-fault-abort-grpc-request-percentage
andx-envoy-fault-abort-grpc-request
HTTP headers.
and for this reason I thought that we needed the addition of x-envoy-fault-abort-grpc-request-percentage
HTTP header.
Now that I know that the implementation looks more like apply "either HTTP or GRPC abort" (only one of them) I don't see any benefits of us having x-envoy-fault-abort-grpc-request-percentage
HTTP header. Don't get me wrong, as far as I can tell current implementation works correctly and it unlocks GRPC abort tests for mobile people at Lyft but the addition of x-envoy-fault-abort-grpc-request-percentage
HTTP header itself doesn't add any value in here - we could continue to use x-envoy-fault-abort-request-percentage
HTTP request header and functionality-wise we wouldn't lose anything. We could keep it but I think that removal makes a little more sense since it would keep the implementation simpler.
I really feel bad asking you to do this @vijayrajput1 but could we get rid of x-envoy-fault-abort-grpc-request-percentage
HTTP header and use x-envoy-fault-abort-request-percentage
for both HTTP and GRPC aborts? My deep apologizes here - I know that the addition of this header costed you a significant amount of work in the first place so it really feels bad to ask you to revert some of your changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Augustyniak , I actually had this (using one header) approach in mind earlier but then I kinda convinced myself having a matching header would make things more clear. Meaning use x-envoy-fault-abort-grpc-request-percentage
for x-envoy-fault-abort-grpc-request
vs use x-envoy-fault-abort-request-percentage
for x-envoy-fault-abort-grpc-request
. I should have raised this point earlier. I'm okay, with either of those approaches - my take is if having a separate header x-envoy-fault-abort-grpc-request-percentage
isn't making things anymore clear than we should use only one header i.e x-envoy-fault-abort-request-percentage
- @mattklein123 any preference ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed on using 1 header. Sorry, I didn't look very deeply at this before punting review over to @Augustyniak. Apologies also!
@@ -70,31 +75,44 @@ class FaultAbortConfig { | |||
|
|||
// Return the HTTP status code to use. Optionally passed HTTP headers that may contain the | |||
// HTTP status code depending on the provider implementation. | |||
virtual absl::optional<Http::Code> | |||
statusCode(const Http::RequestHeaderMap* request_headers) const PURE; | |||
virtual absl::optional<Http::Code> httpStatusCode(const Http::HeaderEntry* header) const PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not continue to pass the full header map here and below? It seems like it would be more clear if the header was referenced closer to where it is used?
@@ -42,6 +42,12 @@ x-envoy-fault-abort-request | |||
In order for the header to work, :ref:`header_abort | |||
<envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set. | |||
|
|||
x-envoy-fault-abort-grpc-request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage
HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage
HTTP header is defined).
Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage
HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.
|
||
// If gRPC status code is set, then HTTP will be set to 200 | ||
return grpc_status.has_value() ? std::make_pair(Http::Code::OK, grpc_status) | ||
: std::make_pair(abortHttpStatus(request_headers), grpc_status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass absl::nullptr
instead of grpc_status
here to make it clearer that we don't return GRPC status code when we return an HTTP status code from abortHttpStatus(request_headers)
?
auto grpc_status = fault_settings_->requestAbort()->grpcStatusCode( | ||
request_headers.get(Filters::Common::Fault::HeaderNames::get().AbortGrpcRequest)); | ||
|
||
auto default_grpc_status_code = grpc_status.has_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we return an absl::nullptr
from here if grpc_status
doesn't have a value? We do a similar thing in the implementation of abortStatus
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intention here is to support setting gRPC status even through just runtime config. For example, even if abort fault filter configuration and header doesn't have grpc_status but if the abort_grpc_status_runtime is set we still want to set the grpc_status. Does this make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case you want to support definitely makes sense. 👍
I'm just a little bit worried that GRPC aborts will end up using proto config in a way that's different from the way other faults use their protos. "Fixed" fault providers - including HTTP status code, response rate limit and delay request - providers, all require proto config values to be present. For the user of the Envoy, it may be a little bit confusing if for GRPC aborts we follow a different convention.
That being said, if runtime-only config is the thing you want to support I'm going to leave it up to more Envoy experienced folks to decide whether it's ok or no to have this fault provider not require full proto config (no GRPC status code required in proto config).
//EDIT
It may fill a little bit off to have a GRPC abort percentage defined but no GRPC abort status code defined in proto config even if fixed GRPC aborts are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Augustyniak that we should be consistent here. I'm going to defer to @Augustyniak for initial review of this PR and I will do a pass after. :)
and address review feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing previous comments! It looks good but I left a few questions regarding cases for when we have http and grpc abort faults defined for a given request.
envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage( | ||
const Http::RequestHeaderMap* request_headers) const { | ||
// if the abort fault is for grpc status, then use grpc fault request percentage. | ||
if (request_headers != nullptr && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to check for whether request_headers
are nullptr here - it should never happen. Per my other discussion with @mattklein123 here I believe that we don't want to do this kind of defensive programming if we know that a given condition (existing request_headers in this case) is always fulfilled.
@@ -13,6 +13,10 @@ namespace Fault { | |||
|
|||
envoy::type::v3::FractionalPercent | |||
HeaderPercentageProvider::percentage(const Http::RequestHeaderMap* request_headers) const { | |||
if (request_headers == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to my other comment - request_headers
should be always available here (!= nullptr) and for this reason we don't need to have this check here. I had a discussion about this with Matt here #10724 (comment).
const Http::RequestHeaderMap* request_headers) const { | ||
absl::optional<Http::Code> ret; | ||
const auto header = request_headers->get(HeaderNames::get().AbortRequest); | ||
if (request_headers == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above - no need to check for nullptr here.
@@ -64,6 +79,37 @@ absl::optional<Http::Code> FaultAbortConfig::HeaderAbortProvider::statusCode( | |||
return ret; | |||
} | |||
|
|||
absl::optional<Grpc::Status::GrpcStatus> FaultAbortConfig::HeaderAbortProvider::grpcStatusCode( | |||
const Http::RequestHeaderMap* request_headers) const { | |||
if (request_headers == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have this check here - the same reasoning as in one of the above comments.
return static_cast<Grpc::Status::GrpcStatus>(code); | ||
} | ||
|
||
envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether with this implementation abort faults always work the way we want them to work. Let me give you an example that I was able to come up with and you may let me know whether my concerns are valid (in this case we should probably fix the issue) or whether I'm missing something here.
Example:
Let's say that I send a request with the following HTTP headers:
x-envoy-fault-abort-grpc-request-percentage
:20
x-envoy-fault-abort-request
:500
and a proto config telling Envoy to look for faults in HTTP headers.
Behavior:
There is 20% change that we apply 500 HTTP status code abort fault.
Expected behavior:
Per our documentation, in this case we should ingest HTTP status code abort fault in % of requests that's controlled by percentage field from our config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether you will need to do this but you may need to create separate fault providers for HTTP and GRPC aborts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, x-envoy-fault-abort-grpc-request-percentage
value will be used only when x-envoy-fault-abort-grpc-request
header is passed in. For the above example, we will end up using the percentage value from config file as mentioned in the documentation.
Here is the control logic for it, let me know if the comment and code around it is not clear, I can reword it. I'll add a unit test to capture this scenario.
// if the abort fault is for grpc status, then use grpc fault request percentage.
// if the abort fault is for grpc status, then use grpc fault request percentage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 👍 Thanks for pointing this out - your are right, it works just fine as it is now. The only thing that I could be a little bit picky about here is the fact that we read the value x-envoy-fault-abort-grpc-request
HTTP header twice and apparently (see this convo for more details) reading HTTP headers is not cheap performance wise.
Maybe we could read x-envoy-fault-abort-request
and x-envoy-fault-grpc-request
headers first and pass it to percentage
method for it to determine which percentage should be used? Better yet, if we read x-envoy-fault-abort-request
first and it's present we don't even need to check the value of x-envoy-fault-abort-grpc-request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case it's not a blocker tho - just something to think about (maybe we can address it if it's easily fixable, if not I think that we can keep it the way it is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Augustyniak, I couldn’t find a clean and easy way to avoid this check, so leaving it as is, but I’ve changed the precedence of http_status and grpc_status per recommendation.
@@ -18,21 +18,49 @@ TEST(FaultConfigTest, FaultAbortHeaderConfig) { | |||
proto_config.mutable_header_abort(); | |||
FaultAbortConfig config(proto_config); | |||
|
|||
// No header. | |||
EXPECT_EQ(absl::nullopt, config.httpStatusCode(nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot happen in prod. See my previous comments related to this case.
auto grpc_status = abortGrpcStatus(request_headers); | ||
|
||
// If gRPC status code is set, then HTTP will be set to Http::Code::OK (200). | ||
return grpc_status.has_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it's implemented now we prioritize grpc faults over http faults. Not sure whether it's a requirement for your implementation but to me http faults should take a precedence over grpc faults since grpc faults are kind of build on top of http requests -> so say it differently, my intuition tell me that grpc faults can happen only if http request succeeds.
Let's say that I send
x-envoy-fault-abort-request: 500
and x-envoy-fault-abort-grpc-request: 14
HTTP headers as part of my request. In this case, we end up with a request that finishes with a 200 HTTP status code and 14 grpc status code. I wouldn't not expect to receive 200 HTTP status code if I send x-envoy-fault-abort-request: 500
HTTP header.
What would you say to the idea of changing the order here? Check for the HTTP status code abort header first and look into grpc status code abort header only if the http status code abort header is not present.
Going back to my example, if we change the order and send x-envoy-fault-abort-request: 500
and x-envoy-fault-abort-grpc-request: 14
HTTP headers. We end up with a request that fails with 500 HTTP status code, it feels OK that we didn't receive 14 grpc status code since the underlying HTTP request failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the precedence order favoring http_status over grpc_status headers. And more importantly, updated the documentation on what to expect during such conflicts.
Http::TestRequestHeaderMapImpl good_headers{{"x-envoy-fault-abort-grpc-request", "5"}}; | ||
EXPECT_EQ(Grpc::Status::NotFound, config.grpcStatusCode(&good_headers)); | ||
|
||
// Valid header - unknown gRPC code (>0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment doesn't seem right
Thank you @Augustyniak for the good feedback - appreciated! 🙏 Regarding, http_status vs grpc_status precedence when both headers are present - we can surely reverse the order to use grpc_status value only when http_status is not present and/or we can explicitly call out in the documentation to use only one of those header to make it similar to how its enforced in proto's via |
1. If both x-envoy-fault-abort-grpc-request and x-envoy-fault-abort-request headers are set, prefer x-envoy-fault-abort-request. 2. Remove nullptr checks per feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one comment, other than this I don't have any other comments in this PR. It looks great! Thank you for being so easy to work with. I'm going to approve once you respond to/address the comment I left.
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now! One tiny style related comment.
Thank you for your active collaboration here - I think that it helped to make this PR better.
const Http::RequestHeaderMap* request_headers) const { | ||
absl::optional<Http::Code> ret; | ||
const auto header = request_headers->get(HeaderNames::get().AbortRequest); | ||
absl::optional<Http::Code> ret = absl::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we need = absl::nullopt
part here - I think that we should get rid of it if it's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep its not required, I put that in there to make it apparent. I had to read absl::optional
default details to find out this, so I thought making it apparent might help readability. I can remove it if its something basic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine either way. I wouldn't worry about it unless we need another push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks!
Signed-off-by: Spencer Lewis <slewis@squareup.com> * master: fault injection: add support for setting gRPC status (envoyproxy#10841) tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940) Fix typo on Postgres Proxy documentation. (envoyproxy#10930) fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931) gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508) http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941) stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735) hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929) prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833) status: Fix ASAN error in Status payload handling (envoyproxy#10906) path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922) compressor filter: add benchmark (envoyproxy#10464) xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934) ci: update before purge in cleanup (envoyproxy#10938) tracer: Improve test coverage for x-ray (envoyproxy#10890) Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)
Description: Adds an option to specify gRPC status via config and header, to be used in fault injection response. More details #10323
Risk Level: Low (new feature)
Testing: Added unit and integration tests
Docs Changes: Updated
Release Notes: Updated
Fixes #10323
cc @wgallagher @mattklein123