Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingress-gateway): support outlier detection of upstream service for ingress gateway #15614

Merged
merged 16 commits into from
Dec 13, 2022

Conversation

huikang
Copy link
Collaborator

@huikang huikang commented Nov 30, 2022

Description

Support outlier detection of upstream service for ingress gateway.
Consul has supported adding outlier detection to envoy proxy for the connect-proxy services. This PR adds the similar function to ingress gateway proxy.

Example:

Kind = "ingress-gateway"
Name = "ingress-gateway"
Defaults {
  PassiveHealthCheck {
    max_failures = 10   // defaults for all upstream services
  }
Listeners = [
 {
   Port = 18080
   Services = [
     {
        Name = "backend"
        PassiveHealthCheck {
          max_failures = 20   // override the defaults
        }
     },
   ]
 },
}

generated envoy config

      "outlier_detection": {
       "consecutive_5xx": 20,
      },

Testing & Reproduction steps

  • An example is added to the integration test

Links

Fix #15401

Please be mindful not to leak any customer or confidential information. HashiCorp employees may want to use our internal URL shortener to obfuscate links.

PR Checklist

  • updated test coverage
  • external facing docs updated
  • not a security concern

@huikang huikang requested a review from a team as a code owner November 30, 2022 16:07
@github-actions github-actions bot added theme/api Relating to the HTTP API interface theme/envoy/xds Related to Envoy support type/docs Documentation needs to be created/updated/clarified labels Nov 30, 2022
Copy link
Contributor

@trujillo-adam trujillo-adam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions for the docs

Copy link
Contributor

@eculver eculver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We should probably let someone from the core team chime in too just for good measure, but all good from my perspective.

Since this is technically a new feature, do we want to backport this?

@huikang
Copy link
Collaborator Author

huikang commented Dec 2, 2022

Since this is technically a new feature, do we want to backport this?

@eculver , thanks for your comments. I think we have several patch releases (e.g., 1.14.2) that contains new features for envoy proxy. So I think backport should be fine.

outlierDetection.Consecutive_5Xx = &wrappers.UInt32Value{Value: svc.PassiveHealthCheck.MaxFailures}
}

// if svc.PassiveHealthCheck.EnforcingConsecutive5xx != nil && *svc.PassiveHealthCheck.EnforcingConsecutive5xx != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

outlierDetection := ToOutlierDetection(cfgSnap.IngressGateway.Defaults.PassiveHealthCheck)

// override the default outlier detection value
if svc.PassiveHealthCheck != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would simplify things to either update ToOutlierDetection or make a new function that takes a variable number of *structs.PassiveHealthChecks and peeredService. If we do that, I believe we can share this logic everywhere in this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Add an override parameter to ToOutlierDetection.

outlierDetection.MaxEjectionPercent = &wrappers.UInt32Value{Value: c.OutlierDetection.MaxEjectionPercent.Value}
}

c.OutlierDetection = outlierDetection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think always setting outlier detection will end up being a breaking change and is causing golden tests like this one to change. Is that intended?

If so, we should make this a breaking change in the changelog and we can't backport it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After second thought, I think adding empty outlier_detection make it consistent with other places, like connect-proxy, mesh-gateway

So I added a note of this breaking change to the changelog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is going to be a breaking change, could you remove the backport labels from this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, backport labels are removed.

entry.Listeners[0].Services[0].MaxConnections = 4096
entry.Listeners[0].Services[0].MaxPendingRequests = 2048
entry.Listeners[0].Services[0].PassiveHealthCheck = &structs.PassiveHealthCheck{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be safer to add the tests for outlier detection to new cases so it's easier to ensure the existing cases don't change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

@huikang huikang force-pushed the gh-15401-passive-health-check-ingress branch from b34a3f2 to 4d13b4e Compare December 9, 2022 15:30
Copy link
Collaborator Author

@huikang huikang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erichaberkorn , thanks for your review. PR is updated. Please take another look.

entry.Listeners[0].Services[0].MaxConnections = 4096
entry.Listeners[0].Services[0].MaxPendingRequests = 2048
entry.Listeners[0].Services[0].PassiveHealthCheck = &structs.PassiveHealthCheck{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

outlierDetection := ToOutlierDetection(cfgSnap.IngressGateway.Defaults.PassiveHealthCheck)

// override the default outlier detection value
if svc.PassiveHealthCheck != nil {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Add an override parameter to ToOutlierDetection.

outlierDetection.MaxEjectionPercent = &wrappers.UInt32Value{Value: c.OutlierDetection.MaxEjectionPercent.Value}
}

c.OutlierDetection = outlierDetection
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After second thought, I think adding empty outlier_detection make it consistent with other places, like connect-proxy, mesh-gateway

So I added a note of this breaking change to the changelog.

@@ -544,9 +546,9 @@ message UpstreamLimits {
message PassiveHealthCheck {
// mog: func-to=structs.DurationFromProto func-from=structs.DurationToProto
google.protobuf.Duration Interval = 1;
uint32 MaxFailures = 2;
optional uint32 MaxFailures = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is not a breaking change.

Optional makes mog know if MaxFailures is defined or not. Otherwise, unspecified MaxFailures will become "0", which is a wrong value. In the PR, I added a test for unspecified EnforcingConsecutive5xx, which is the same type as MaxFailures, to verify the default value is "null" for envoy.

https://github.com/hashicorp/consul/pull/15614/files#diff-a44e2ceb5ced90fcc9be4d9518bb62626d62438751d9cc67286de9df15d15b54R72

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erichaberkorn , there is a test that asserts the maxFailure in upstream config

[ "$(echo $CLUSTER_CONFIG | jq --raw-output '.outlier_detection.consecutive_5xx')" = "4" ]
[ "$(echo $CLUSTER_CONFIG | jq --raw-output '.outlier_detection.interval')" = "22s" ]
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test is added to verify unspecified MaxFailure in upstreams.config.passive_health_check

30af125

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about a protobuf compatibility breaking change. I'm pretty sure we would need to introduce a completely different field on the protobuf and deprecate the current field to get this working. Then we could remove the current field in a follow-up release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erichaberkorn , reverted the change to message PassiveHealthCheck by adding the constraint of EnforcingConsecutive5xx > 0. WDYT?

@huikang huikang force-pushed the gh-15401-passive-health-check-ingress branch from ab1a015 to ec19956 Compare December 9, 2022 23:46
outlierDetection := ToOutlierDetection(cfgSnap.IngressGateway.Defaults.PassiveHealthCheck, override)

// Specail handling for failover peering service, which has set MaxEjectionPercent
if c.OutlierDetection != nil && c.OutlierDetection.MaxEjectionPercent != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for yet another comment. I just noticed that we aren't setting panic threshold here similar to other clusters. For the sake of consistency, I'm guessing we want to set it like we do on other cluster types (because panic threshold interacts with outlier detection).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am little confused since this has been set for the upstream clusters of the ingress gateway in

func (s *ResourceGenerator) makeUpstreamClustersForDiscoveryChain(
, which is called in making upstream clusters for the ingress-gateway service
upstreamClusters, err := s.makeUpstreamClustersForDiscoveryChain(
uid,
&u,
chain,
cfgSnap,
false,
)
if err != nil {

}

// NOTE: EnforcingConsecutive5xx must be great than 0
if p.EnforcingConsecutive5xx != nil && *p.EnforcingConsecutive5xx > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this ends up getting rid of the ability for users to set EnforcingConsecutive5xx to zero which is a breaking change for other proxy types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some logic to handle the legacy scenarios. Please take another look. Thanks.

@huikang huikang force-pushed the gh-15401-passive-health-check-ingress branch from b695296 to 5d6d9a4 Compare December 12, 2022 20:07
agent/xds/config.go Outdated Show resolved Hide resolved
agent/xds/config.go Outdated Show resolved Hide resolved
Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>
@huikang huikang merged commit 04bf24c into main Dec 13, 2022
@huikang huikang deleted the gh-15401-passive-health-check-ingress branch December 13, 2022 16:51
jmurret pushed a commit that referenced this pull request Dec 13, 2022
…for ingress gateway (#15614)

* feat(ingress-gateway): support outlier detection of upstream service for ingress gateway

* changelog

Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>
t-eckert pushed a commit that referenced this pull request Dec 15, 2022
…for ingress gateway (#15614)

* feat(ingress-gateway): support outlier detection of upstream service for ingress gateway

* changelog

Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>
'Passive health checks remove hosts from the upstream cluster that are unreachable or that return errors.',
children: [
{
name: 'interval',
Copy link

@shweshi shweshi Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huikang This should be Interval and not interval. Same for the Line 1243.

"The time in nanosecond between checks. Each check will cause hosts which have exceeded `max_failures` to be removed from the load balancer, and any hosts which have passed their ejection time to be returned to the load balancer. If not specified, it uses the default value. For example, 10s for Envoy proxy.",
},
{
name: 'max_failures',
Copy link

@shweshi shweshi Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huikang The name should be MaxFailures. Same for the Line 1247.

'The number of consecutive failures that cause a host to be removed from the upstream cluster. If not specified, Consul uses the proxy\'s default value. For example, `5` for Envoy proxy.',
},
{
name: 'enforcing_consecutive_5xx',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huikang The name field should be EnforcingConsecutive5xx. Same for the Line 1251.

@shweshi
Copy link

shweshi commented Jan 5, 2023

@huikang Any idea in from which consul version this feature will be available. Since the backport to 1.14 and 1.13 has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/api Relating to the HTTP API interface theme/envoy/xds Related to Envoy support type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to set circuit breaker config and outlier detection config for Ingress Gateway
6 participants