Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fault injection: add support for setting gRPC status #10841

Merged
merged 10 commits into from
Apr 27, 2020

Conversation

vijayrajput1
Copy link
Contributor

@vijayrajput1 vijayrajput1 commented Apr 19, 2020

Description: Adds an option to specify gRPC status via config and header, to be used in fault injection response. More details #10323

Risk Level: Low (new feature)
Testing: Added unit and integration tests
Docs Changes: Updated
Release Notes: Updated
Fixes #10323

cc @wgallagher @mattklein123

vijayrajput1 and others added 3 commits April 9, 2020 09:22
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #10841 was opened by vijayrajput1.

see: more, trace.

Signed-off-by: Vijay Rajput <vrajput@lyft.com>
@vijayrajput1 vijayrajput1 marked this pull request as ready for review April 19, 2020 13:53
@htuch
Copy link
Member

htuch commented Apr 20, 2020

/lgtm api

@mattklein123 mattklein123 self-assigned this Apr 20, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks generally looks great! Flushing out a few comments to get started. Thank you!

/wait

@@ -41,6 +42,9 @@ message FaultAbort {
// HTTP status code to use to abort the HTTP request.
uint32 http_status = 2 [(validate.rules).uint32 = {lt: 600 gte: 200}];

// gRPC status code to use to abort the gRPC request.
uint32 grpc_status = 5 [(validate.rules).uint32 = {lte: 16}];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should drop the "lte" restriction here so that we can test unknown gRPC codes above those well defined. This is a really common source of bugs (including in envoy itself) so it will be nice to be able to test this. I might clarify in the docs that we allow testing any numeric code, even those that are not well defined. (docs need updating also)

@@ -42,6 +42,12 @@ x-envoy-fault-abort-request
In order for the header to work, :ref:`header_abort
<envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set.

x-envoy-fault-abort-grpc-request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a % header here also? cc @Augustyniak

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage HTTP header is defined).

Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed one thing when I posted my initial comment here.

Initially, I thought that GRPC aborts would be applied on top of http aborts meaning something like:

  1. Requests comes in
  2. Check whether to abort the request with http fault - look at x-envoy-fault-abort-request-percentage and x-envoy-fault-abort-request HTTP headers.
  3. If no http fault was applied (because percentage was <100% and we decided not to apply HTTP fault) check whether grpc faults should be applied - look at x-envoy-fault-abort-grpc-request-percentage and x-envoy-fault-abort-grpc-request HTTP headers.

and for this reason I thought that we needed the addition of x-envoy-fault-abort-grpc-request-percentage HTTP header.

Now that I know that the implementation looks more like apply "either HTTP or GRPC abort" (only one of them) I don't see any benefits of us having x-envoy-fault-abort-grpc-request-percentage HTTP header. Don't get me wrong, as far as I can tell current implementation works correctly and it unlocks GRPC abort tests for mobile people at Lyft but the addition of x-envoy-fault-abort-grpc-request-percentage HTTP header itself doesn't add any value in here - we could continue to use x-envoy-fault-abort-request-percentage HTTP request header and functionality-wise we wouldn't lose anything. We could keep it but I think that removal makes a little more sense since it would keep the implementation simpler.

I really feel bad asking you to do this @vijayrajput1 but could we get rid of x-envoy-fault-abort-grpc-request-percentage HTTP header and use x-envoy-fault-abort-request-percentage for both HTTP and GRPC aborts? My deep apologizes here - I know that the addition of this header costed you a significant amount of work in the first place so it really feels bad to ask you to revert some of your changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Augustyniak , I actually had this (using one header) approach in mind earlier but then I kinda convinced myself having a matching header would make things more clear. Meaning use x-envoy-fault-abort-grpc-request-percentage for x-envoy-fault-abort-grpc-request vs use x-envoy-fault-abort-request-percentage for x-envoy-fault-abort-grpc-request. I should have raised this point earlier. I'm okay, with either of those approaches - my take is if having a separate header x-envoy-fault-abort-grpc-request-percentage isn't making things anymore clear than we should use only one header i.e x-envoy-fault-abort-request-percentage - @mattklein123 any preference ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed on using 1 header. Sorry, I didn't look very deeply at this before punting review over to @Augustyniak. Apologies also!

@@ -70,31 +75,44 @@ class FaultAbortConfig {

// Return the HTTP status code to use. Optionally passed HTTP headers that may contain the
// HTTP status code depending on the provider implementation.
virtual absl::optional<Http::Code>
statusCode(const Http::RequestHeaderMap* request_headers) const PURE;
virtual absl::optional<Http::Code> httpStatusCode(const Http::HeaderEntry* header) const PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not continue to pass the full header map here and below? It seems like it would be more clear if the header was referenced closer to where it is used?

@@ -42,6 +42,12 @@ x-envoy-fault-abort-request
In order for the header to work, :ref:`header_abort
<envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set.

x-envoy-fault-abort-grpc-request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage HTTP header is defined).

Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.


// If gRPC status code is set, then HTTP will be set to 200
return grpc_status.has_value() ? std::make_pair(Http::Code::OK, grpc_status)
: std::make_pair(abortHttpStatus(request_headers), grpc_status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass absl::nullptr instead of grpc_status here to make it clearer that we don't return GRPC status code when we return an HTTP status code from abortHttpStatus(request_headers) ?

auto grpc_status = fault_settings_->requestAbort()->grpcStatusCode(
request_headers.get(Filters::Common::Fault::HeaderNames::get().AbortGrpcRequest));

auto default_grpc_status_code = grpc_status.has_value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we return an absl::nullptr from here if grpc_status doesn't have a value? We do a similar thing in the implementation of abortStatus method.

Copy link
Contributor Author

@vijayrajput1 vijayrajput1 Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intention here is to support setting gRPC status even through just runtime config. For example, even if abort fault filter configuration and header doesn't have grpc_status but if the abort_grpc_status_runtime is set we still want to set the grpc_status. Does this make sense ?

Copy link
Contributor

@Augustyniak Augustyniak Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case you want to support definitely makes sense. 👍

I'm just a little bit worried that GRPC aborts will end up using proto config in a way that's different from the way other faults use their protos. "Fixed" fault providers - including HTTP status code, response rate limit and delay request - providers, all require proto config values to be present. For the user of the Envoy, it may be a little bit confusing if for GRPC aborts we follow a different convention.

That being said, if runtime-only config is the thing you want to support I'm going to leave it up to more Envoy experienced folks to decide whether it's ok or no to have this fault provider not require full proto config (no GRPC status code required in proto config).

//EDIT
It may fill a little bit off to have a GRPC abort percentage defined but no GRPC abort status code defined in proto config even if fixed GRPC aborts are used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Augustyniak that we should be consistent here. I'm going to defer to @Augustyniak for initial review of this PR and I will do a pass after. :)

and address review feedback.

Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Copy link
Contributor

@Augustyniak Augustyniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing previous comments! It looks good but I left a few questions regarding cases for when we have http and grpc abort faults defined for a given request.

envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage(
const Http::RequestHeaderMap* request_headers) const {
// if the abort fault is for grpc status, then use grpc fault request percentage.
if (request_headers != nullptr &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to check for whether request_headers are nullptr here - it should never happen. Per my other discussion with @mattklein123 here I believe that we don't want to do this kind of defensive programming if we know that a given condition (existing request_headers in this case) is always fulfilled.

@@ -13,6 +13,10 @@ namespace Fault {

envoy::type::v3::FractionalPercent
HeaderPercentageProvider::percentage(const Http::RequestHeaderMap* request_headers) const {
if (request_headers == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to my other comment - request_headers should be always available here (!= nullptr) and for this reason we don't need to have this check here. I had a discussion about this with Matt here #10724 (comment).

const Http::RequestHeaderMap* request_headers) const {
absl::optional<Http::Code> ret;
const auto header = request_headers->get(HeaderNames::get().AbortRequest);
if (request_headers == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - no need to check for nullptr here.

@@ -64,6 +79,37 @@ absl::optional<Http::Code> FaultAbortConfig::HeaderAbortProvider::statusCode(
return ret;
}

absl::optional<Grpc::Status::GrpcStatus> FaultAbortConfig::HeaderAbortProvider::grpcStatusCode(
const Http::RequestHeaderMap* request_headers) const {
if (request_headers == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have this check here - the same reasoning as in one of the above comments.

return static_cast<Grpc::Status::GrpcStatus>(code);
}

envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether with this implementation abort faults always work the way we want them to work. Let me give you an example that I was able to come up with and you may let me know whether my concerns are valid (in this case we should probably fix the issue) or whether I'm missing something here.

Example:

Let's say that I send a request with the following HTTP headers:

  • x-envoy-fault-abort-grpc-request-percentage: 20
  • x-envoy-fault-abort-request: 500
    and a proto config telling Envoy to look for faults in HTTP headers.

Behavior:
There is 20% change that we apply 500 HTTP status code abort fault.

Expected behavior:
Per our documentation, in this case we should ingest HTTP status code abort fault in % of requests that's controlled by percentage field from our config file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether you will need to do this but you may need to create separate fault providers for HTTP and GRPC aborts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, x-envoy-fault-abort-grpc-request-percentage value will be used only when x-envoy-fault-abort-grpc-request header is passed in. For the above example, we will end up using the percentage value from config file as mentioned in the documentation.

Here is the control logic for it, let me know if the comment and code around it is not clear, I can reword it. I'll add a unit test to capture this scenario.

// if the abort fault is for grpc status, then use grpc fault request percentage.

// if the abort fault is for grpc status, then use grpc fault request percentage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 👍 Thanks for pointing this out - your are right, it works just fine as it is now. The only thing that I could be a little bit picky about here is the fact that we read the value x-envoy-fault-abort-grpc-request HTTP header twice and apparently (see this convo for more details) reading HTTP headers is not cheap performance wise.

Maybe we could read x-envoy-fault-abort-request and x-envoy-fault-grpc-request headers first and pass it to percentage method for it to determine which percentage should be used? Better yet, if we read x-envoy-fault-abort-request first and it's present we don't even need to check the value of x-envoy-fault-abort-grpc-request

Copy link
Contributor

@Augustyniak Augustyniak Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case it's not a blocker tho - just something to think about (maybe we can address it if it's easily fixable, if not I think that we can keep it the way it is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Augustyniak, I couldn’t find a clean and easy way to avoid this check, so leaving it as is, but I’ve changed the precedence of http_status and grpc_status per recommendation.

@@ -18,21 +18,49 @@ TEST(FaultConfigTest, FaultAbortHeaderConfig) {
proto_config.mutable_header_abort();
FaultAbortConfig config(proto_config);

// No header.
EXPECT_EQ(absl::nullopt, config.httpStatusCode(nullptr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen in prod. See my previous comments related to this case.

auto grpc_status = abortGrpcStatus(request_headers);

// If gRPC status code is set, then HTTP will be set to Http::Code::OK (200).
return grpc_status.has_value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it's implemented now we prioritize grpc faults over http faults. Not sure whether it's a requirement for your implementation but to me http faults should take a precedence over grpc faults since grpc faults are kind of build on top of http requests -> so say it differently, my intuition tell me that grpc faults can happen only if http request succeeds.

Let's say that I send
x-envoy-fault-abort-request: 500 and x-envoy-fault-abort-grpc-request: 14 HTTP headers as part of my request. In this case, we end up with a request that finishes with a 200 HTTP status code and 14 grpc status code. I wouldn't not expect to receive 200 HTTP status code if I send x-envoy-fault-abort-request: 500 HTTP header.

What would you say to the idea of changing the order here? Check for the HTTP status code abort header first and look into grpc status code abort header only if the http status code abort header is not present.

Going back to my example, if we change the order and send x-envoy-fault-abort-request: 500 and x-envoy-fault-abort-grpc-request: 14 HTTP headers. We end up with a request that fails with 500 HTTP status code, it feels OK that we didn't receive 14 grpc status code since the underlying HTTP request failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the precedence order favoring http_status over grpc_status headers. And more importantly, updated the documentation on what to expect during such conflicts.

Http::TestRequestHeaderMapImpl good_headers{{"x-envoy-fault-abort-grpc-request", "5"}};
EXPECT_EQ(Grpc::Status::NotFound, config.grpcStatusCode(&good_headers));

// Valid header - unknown gRPC code (>0).
Copy link
Contributor

@Augustyniak Augustyniak Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment doesn't seem right

@vijayrajput1
Copy link
Contributor Author

Thank you for addressing previous comments! It looks good but I left a few questions regarding cases for when we have http and grpc abort faults defined for a given request.

Thank you @Augustyniak for the good feedback - appreciated! 🙏

Regarding, http_status vs grpc_status precedence when both headers are present - we can surely reverse the order to use grpc_status value only when http_status is not present and/or we can explicitly call out in the documentation to use only one of those header to make it similar to how its enforced in proto's via oneof clause.

1. If both x-envoy-fault-abort-grpc-request and x-envoy-fault-abort-request headers are set, prefer x-envoy-fault-abort-request.

2. Remove nullptr checks per feedback.

Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Copy link
Contributor

@Augustyniak Augustyniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment, other than this I don't have any other comments in this PR. It looks great! Thank you for being so easy to work with. I'm going to approve once you respond to/address the comment I left.

Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Copy link
Contributor

@Augustyniak Augustyniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now! One tiny style related comment.

Thank you for your active collaboration here - I think that it helped to make this PR better.

const Http::RequestHeaderMap* request_headers) const {
absl::optional<Http::Code> ret;
const auto header = request_headers->get(HeaderNames::get().AbortRequest);
absl::optional<Http::Code> ret = absl::nullopt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we need = absl::nullopt part here - I think that we should get rid of it if it's not needed.

Copy link
Contributor Author

@vijayrajput1 vijayrajput1 Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep its not required, I put that in there to make it apparent. I had to read absl::optional default details to find out this, so I thought making it apparent might help readability. I can remove it if its something basic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine either way. I wouldn't worry about it unless we need another push.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks!

@mattklein123 mattklein123 merged commit fd36665 into envoyproxy:master Apr 27, 2020
spenceral added a commit to spenceral/envoy that referenced this pull request Apr 27, 2020
Signed-off-by: Spencer Lewis <slewis@squareup.com>

* master:
  fault injection: add support for setting gRPC status (envoyproxy#10841)
  tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940)
  Fix typo on Postgres Proxy documentation. (envoyproxy#10930)
  fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931)
  gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508)
  http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941)
  stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735)
  hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929)
  prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833)
  status: Fix ASAN error in Status payload handling (envoyproxy#10906)
  path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922)
  compressor filter: add benchmark (envoyproxy#10464)
  xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934)
  ci: update before purge in cleanup (envoyproxy#10938)
  tracer: Improve test coverage for x-ray (envoyproxy#10890)
  Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fault Injection: Support gRPC status code in abort fault
4 participants