-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy gets stuck "initializing secondary clusters" after a restart when service has zero endpoints #420
Comments
@alexbrand thanks for reporting this issue, could you please attach
|
Thanks @davecheney. Here are logs and config: 1. Logs
2. Config
|
Can you please attach the contents of 'badservice'. Thanks |
I think this is unrelated to
(once logging has been turned up to debug) This makes me think that contour cannot resolve the DNS name for 127.0.0.1 to connect to the contour cluster for eds/cds/etc. 127.0.0.1 isn't a DNS name so I'm going to try twiddling with the various dns lookup types. |
Nope, it's not related to the use of |
ok, what I think is going on is on restart envoy is not picking up the listeners or not making a connection to LDS |
Here's an easier reproducer; swap the order of the envoy and contour containers so contour is running (most likely) before envoy
This increases the changes that contour will be running so the connection during init will succeed, rather than envoy's init completing successfully and then launching the full eds song and dance. |
Updates projectcontour#420 Prior to this PR the envoy container appeared first in the deployment.spec so would likely start before the contour container. This would cause envoy to fail to be able to contact contour immediately upon startup and would then fall back to a retry behaviour. With this change, contour appears first in the spec so will likely start before contour, which agrevates a failure mode reported in issue projectcontour#420. Signed-off-by: Dave Cheney <dave@cheney.net>
Updates projectcontour#420 If stats are not enabled, there is no point in creating a static stats cluster as it will not be used and generates log spam. Signed-off-by: Dave Cheney <dave@cheney.net>
Ok, feeling more confident that envoy isn't fetching listeners iff it gets a bunch of cds config during init. looking in the backend stats I see
But in /clusters we do have clusters created, so something is failing to trigger envoy to talk LDS. |
Possibly a recurrence of envoyproxy/envoy#2931. The symptoms are very similar, envoy doesn't even try to talk LDS. |
Raised envoyproxy/envoy#3530 |
Fixes projectcontour#420 In projectcontour#350 I added the ability to filter the results of a discovery request to address the scalability issues in sending the full copy of the xDS caches to envoy on every request -- this was especially imporant for EDS which would send an entire copy of the EDS cache to each StreamEndpoints listener for every change in k8s endpoints. As part of this change I recognised that there would be situations where a notification that the cache had changed would result a wakeup and calls to values for events which didn't match the filter. At the time I mistakenly thought that if a wakeup was triggered that didn't match the filter this would cause zero length responses to be sent to Envoy causing EDS entries (and others) to pop in and out of existance. For example, if you were watching for endpoints matching "A", then an update happened to endpoints matching "B", I believed that when the "A" watcher ran it would receive an empty set of endpoints and send those back to envoy, effectively erasing the active set on the Envoy side. In fact what happens, and what still happens, is a notification that the cache to the EDS cache wakes up all watchers, they run Values(), passing in their filter. This returns whatever results match the filter, which critically will be whatever is in the eds cache at the time. If there are several results for "A" and the EDS cache was updated for "B", then both "A" and "B"'s watchers will be woken up, they'll both call Values() and then send the set of endpoints back to Envoy. In that case the change to B will be reflected and the change to A will be a no-op. What I tried to avoid was the update to "B" causing the Values for "A" running and finding no endpoints, because I misunderstood that an update to "B"'s endpoints would make it appears that "A"'s endpoint set was empty. This was not the case, and will only be the case if "A"'s endpoint set is actually empty. In the case that "A"'s endpoint set was empty, the logic prior to this PR would skip the notification and wait for the number of endpoints to grow above zero. This _sort of_ made sense when CDS and EDS entries would arrive after startup, but was fundamentally flawed. EDS entries with zero values are a valid response and don't represent a transient state of a filter not matching. Where this bug arose is during startup Envoy _may_ query (depends on timing) EDS values synchronously or asynchronously, and in the former Envoy will first look in CDS for the names of all clusters, then query EDS for the endpoints of all the clusters previously mentioned. If the EDS entry had zero endpoints, the notification would be suppressed and thus Envoy's startup would stall. The fix to this bug is simple, always send the response we get back from Values. If it's empty, then that is because there are no endpoints registered, regardless of the filter applied to trim out endpoints that were not requested. Again, as with projectcontour#350 there is no test, other than the existing test which prove that existing functionality (so far as it is covered) is unaffected. However manual testing, in the form of moving contour first in the deployment descriptor which triggers Envoys synchronous startup is part of projectcontour#422, which should give some coverage. Signed-off-by: Dave Cheney <dave@cheney.net>
Fixes projectcontour#420 In projectcontour#350 I added the ability to filter the results of a discovery request to address the scalability issues in sending the full copy of the xDS caches to envoy on every request -- this was especially imporant for EDS which would send an entire copy of the EDS cache to each StreamEndpoints listener for every change in k8s endpoints. As part of this change I recognised that there would be situations where a notification that the cache had changed would result a wakeup and calls to values for events which didn't match the filter. At the time I mistakenly thought that if a wakeup was triggered that didn't match the filter this would cause zero length responses to be sent to Envoy causing EDS entries (and others) to pop in and out of existance. For example, if you were watching for endpoints matching "A", then an update happened to endpoints matching "B", I believed that when the "A" watcher ran it would receive an empty set of endpoints and send those back to envoy, effectively erasing the active set on the Envoy side. In fact what happens, and what still happens, is a notification that the cache to the EDS cache wakes up all watchers, they run Values(), passing in their filter. This returns whatever results match the filter, which critically will be whatever is in the eds cache at the time. If there are several results for "A" and the EDS cache was updated for "B", then both "A" and "B"'s watchers will be woken up, they'll both call Values() and then send the set of endpoints back to Envoy. In that case the change to B will be reflected and the change to A will be a no-op. What I tried to avoid was the update to "B" causing the Values for "A" running and finding no endpoints, because I misunderstood that an update to "B"'s endpoints would make it appears that "A"'s endpoint set was empty. This was not the case, and will only be the case if "A"'s endpoint set is actually empty. In the case that "A"'s endpoint set was empty, the logic prior to this PR would skip the notification and wait for the number of endpoints to grow above zero. This _sort of_ made sense when CDS and EDS entries would arrive after startup, but was fundamentally flawed. EDS entries with zero values are a valid response and don't represent a transient state of a filter not matching. Where this bug arose is during startup Envoy _may_ query (depends on timing) EDS values synchronously or asynchronously, and in the former Envoy will first look in CDS for the names of all clusters, then query EDS for the endpoints of all the clusters previously mentioned. If the EDS entry had zero endpoints, the notification would be suppressed and thus Envoy's startup would stall. The fix to this bug is simple, always send the response we get back from Values. If it's empty, then that is because there are no endpoints registered, regardless of the filter applied to trim out endpoints that were not requested. Again, as with projectcontour#350 there is no test, other than the existing test which prove that existing functionality (so far as it is covered) is unaffected. However manual testing, in the form of moving contour first in the deployment descriptor which triggers Envoys synchronous startup is part of projectcontour#422, which should give some coverage. Signed-off-by: Dave Cheney <dave@cheney.net>
Updates projectcontour#420 Prior to this PR the envoy container appeared first in the deployment.spec so would likely start before the contour container. This would cause envoy to fail to be able to contact contour immediately upon startup and would then fall back to a retry behaviour. With this change, contour appears first in the spec so will likely start before contour, which agrevates a failure mode reported in issue projectcontour#420. Signed-off-by: Dave Cheney <dave@cheney.net>
@mattmoyer can you tell me why you reopen this please. |
Oops, I think that must have been a misclick. |
Updates projectcontour#420 If stats are not enabled, there is no point in creating a static stats cluster as it will not be used and generates log spam. Signed-off-by: Dave Cheney <dave@cheney.net>
Updates projectcontour#420 If stats are not enabled, there is no point in creating a static stats cluster as it will not be used and generates log spam. Signed-off-by: Dave Cheney <dave@cheney.net>
This change should put us in a better place once we start moving parts of Contour and the Operator around. We can lean more heavily on this test suite rather than the e2e test suite since it should be faster to run/write etc. This code probably duplicates some things from the e2e suite and things from that suite could be brought into here to run faster and with less flakes. Those should be future changes after this starting point. - Moved from Ginkgo to regular go tests (no need for extra ginkgo complexity yet) - Ports existing tests - Add additional test coverage - Some small changes to how controller logs handling of namespace deletion (was confusing to see that the namespace was deleted when it wasnt actually b/c remove on delete was false) Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
While testing Gimbal, we discovered an issue where Contour/Envoy would get stuck initializing, and it wouldn't set up listeners nor routes after creating ingress or ingressroute resources. Interestingly, this only seems to happen if Envoy starts up after Contour.
Initially we thought it was related to #2931, but we then realized that we could reproduce the issue with a single "bad service".
In Gimbal, we deploy Contour and Envoy as separate pods. This makes it easier to reproduce, as we can kill the Envoy pod directly. This is not possible in the default Contour deployment. However, I was able to reproduce the issue by killing the Envoy process directly through the admin console.
Steps to reproduce:
Once Envoy restarts, it gets stuck "initializing secondary clusters" (see Envoy logs).
Furthermore, listeners do not get created for Ingress resources if there are any. This can be seen in Envoy's admin portal at
/listeners
:This issue gets resolved by deleting the service we created:
The text was updated successfully, but these errors were encountered: