-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingress-gateway): support outlier detection of upstream service for ingress gateway #15614
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions for the docs
website/content/docs/connect/config-entries/ingress-gateway.mdx
Outdated
Show resolved
Hide resolved
website/content/docs/connect/config-entries/ingress-gateway.mdx
Outdated
Show resolved
Hide resolved
website/content/docs/connect/config-entries/ingress-gateway.mdx
Outdated
Show resolved
Hide resolved
124790c
to
a381e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! We should probably let someone from the core team chime in too just for good measure, but all good from my perspective.
Since this is technically a new feature, do we want to backport this?
agent/xds/clusters.go
Outdated
outlierDetection.Consecutive_5Xx = &wrappers.UInt32Value{Value: svc.PassiveHealthCheck.MaxFailures} | ||
} | ||
|
||
// if svc.PassiveHealthCheck.EnforcingConsecutive5xx != nil && *svc.PassiveHealthCheck.EnforcingConsecutive5xx != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
agent/xds/clusters.go
Outdated
outlierDetection := ToOutlierDetection(cfgSnap.IngressGateway.Defaults.PassiveHealthCheck) | ||
|
||
// override the default outlier detection value | ||
if svc.PassiveHealthCheck != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would simplify things to either update ToOutlierDetection
or make a new function that takes a variable number of *structs.PassiveHealthCheck
s and peeredService
. If we do that, I believe we can share this logic everywhere in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Add an override
parameter to ToOutlierDetection
.
outlierDetection.MaxEjectionPercent = &wrappers.UInt32Value{Value: c.OutlierDetection.MaxEjectionPercent.Value} | ||
} | ||
|
||
c.OutlierDetection = outlierDetection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think always setting outlier detection will end up being a breaking change and is causing golden tests like this one to change. Is that intended?
If so, we should make this a breaking change in the changelog and we can't backport it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After second thought, I think adding empty outlier_detection make it consistent with other places, like connect-proxy, mesh-gateway
consul/agent/xds/testdata/clusters/connect-proxy-with-tls-outgoing-cipher-suites.latest.golden
Lines 21 to 24 in 4a32070
"outlierDetection": { | |
}, | |
"commonLbConfig": { |
Lines 17 to 19 in 4a32070
"outlierDetection": { | |
} |
So I added a note of this breaking change to the changelog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is going to be a breaking change, could you remove the backport labels from this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, backport labels are removed.
entry.Listeners[0].Services[0].MaxConnections = 4096 | ||
entry.Listeners[0].Services[0].MaxPendingRequests = 2048 | ||
entry.Listeners[0].Services[0].PassiveHealthCheck = &structs.PassiveHealthCheck{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be safer to add the tests for outlier detection to new cases so it's easier to ensure the existing cases don't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
b34a3f2
to
4d13b4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erichaberkorn , thanks for your review. PR is updated. Please take another look.
entry.Listeners[0].Services[0].MaxConnections = 4096 | ||
entry.Listeners[0].Services[0].MaxPendingRequests = 2048 | ||
entry.Listeners[0].Services[0].PassiveHealthCheck = &structs.PassiveHealthCheck{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
agent/xds/clusters.go
Outdated
outlierDetection := ToOutlierDetection(cfgSnap.IngressGateway.Defaults.PassiveHealthCheck) | ||
|
||
// override the default outlier detection value | ||
if svc.PassiveHealthCheck != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Add an override
parameter to ToOutlierDetection
.
outlierDetection.MaxEjectionPercent = &wrappers.UInt32Value{Value: c.OutlierDetection.MaxEjectionPercent.Value} | ||
} | ||
|
||
c.OutlierDetection = outlierDetection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After second thought, I think adding empty outlier_detection make it consistent with other places, like connect-proxy, mesh-gateway
consul/agent/xds/testdata/clusters/connect-proxy-with-tls-outgoing-cipher-suites.latest.golden
Lines 21 to 24 in 4a32070
"outlierDetection": { | |
}, | |
"commonLbConfig": { |
Lines 17 to 19 in 4a32070
"outlierDetection": { | |
} |
So I added a note of this breaking change to the changelog.
@@ -544,9 +546,9 @@ message UpstreamLimits { | |||
message PassiveHealthCheck { | |||
// mog: func-to=structs.DurationFromProto func-from=structs.DurationToProto | |||
google.protobuf.Duration Interval = 1; | |||
uint32 MaxFailures = 2; | |||
optional uint32 MaxFailures = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did these become optional
? https://github.com/hashicorp/consul/pull/15614/files#diff-9d2e9c8b34664ef85c83b6c6f4ec7f939cce7a64f41c267e5db248c1f95b0571R549-R551
Do you know if this is a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is not a breaking change.
Optional makes mog
know if MaxFailures
is defined or not. Otherwise, unspecified MaxFailures
will become "0", which is a wrong value. In the PR, I added a test for unspecified EnforcingConsecutive5xx
, which is the same type as MaxFailures
, to verify the default value is "null" for envoy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erichaberkorn , there is a test that asserts the maxFailure
in upstream config
[ "$(echo $CLUSTER_CONFIG | jq --raw-output '.outlier_detection.consecutive_5xx')" = "4" ] | |
[ "$(echo $CLUSTER_CONFIG | jq --raw-output '.outlier_detection.interval')" = "22s" ] | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another test is added to verify unspecified MaxFailure
in upstreams.config.passive_health_check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about a protobuf compatibility breaking change. I'm pretty sure we would need to introduce a completely different field on the protobuf and deprecate the current field to get this working. Then we could remove the current field in a follow-up release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erichaberkorn , reverted the change to message PassiveHealthCheck
by adding the constraint of EnforcingConsecutive5xx > 0
. WDYT?
ab1a015
to
ec19956
Compare
outlierDetection := ToOutlierDetection(cfgSnap.IngressGateway.Defaults.PassiveHealthCheck, override) | ||
|
||
// Specail handling for failover peering service, which has set MaxEjectionPercent | ||
if c.OutlierDetection != nil && c.OutlierDetection.MaxEjectionPercent != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for yet another comment. I just noticed that we aren't setting panic threshold here similar to other clusters. For the sake of consistency, I'm guessing we want to set it like we do on other cluster types (because panic threshold interacts with outlier detection).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am little confused since this has been set for the upstream clusters of the ingress gateway in
Line 1147 in cd53120
func (s *ResourceGenerator) makeUpstreamClustersForDiscoveryChain( |
Lines 789 to 796 in cd53120
upstreamClusters, err := s.makeUpstreamClustersForDiscoveryChain( | |
uid, | |
&u, | |
chain, | |
cfgSnap, | |
false, | |
) | |
if err != nil { |
agent/xds/config.go
Outdated
} | ||
|
||
// NOTE: EnforcingConsecutive5xx must be great than 0 | ||
if p.EnforcingConsecutive5xx != nil && *p.EnforcingConsecutive5xx > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this ends up getting rid of the ability for users to set EnforcingConsecutive5xx
to zero which is a breaking change for other proxy types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some logic to handle the legacy scenarios. Please take another look. Thanks.
b695296
to
5d6d9a4
Compare
Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>
…for ingress gateway (#15614) * feat(ingress-gateway): support outlier detection of upstream service for ingress gateway * changelog Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>
…for ingress gateway (#15614) * feat(ingress-gateway): support outlier detection of upstream service for ingress gateway * changelog Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>
'Passive health checks remove hosts from the upstream cluster that are unreachable or that return errors.', | ||
children: [ | ||
{ | ||
name: 'interval', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huikang This should be Interval
and not interval. Same for the Line 1243.
"The time in nanosecond between checks. Each check will cause hosts which have exceeded `max_failures` to be removed from the load balancer, and any hosts which have passed their ejection time to be returned to the load balancer. If not specified, it uses the default value. For example, 10s for Envoy proxy.", | ||
}, | ||
{ | ||
name: 'max_failures', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huikang The name should be MaxFailures. Same for the Line 1247.
'The number of consecutive failures that cause a host to be removed from the upstream cluster. If not specified, Consul uses the proxy\'s default value. For example, `5` for Envoy proxy.', | ||
}, | ||
{ | ||
name: 'enforcing_consecutive_5xx', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huikang The name field should be EnforcingConsecutive5xx. Same for the Line 1251.
@huikang Any idea in from which consul version this feature will be available. Since the backport to 1.14 and 1.13 has been removed. |
Description
Support outlier detection of upstream service for ingress gateway.
Consul has supported adding outlier detection to envoy proxy for the connect-proxy services. This PR adds the similar function to ingress gateway proxy.
Example:
generated envoy config
Testing & Reproduction steps
Links
Fix #15401
Please be mindful not to leak any customer or confidential information. HashiCorp employees may want to use our internal URL shortener to obfuscate links.
PR Checklist