Envoy gets stuck "initializing secondary clusters" after a restart when service has zero endpoints #420

alexbrand · 2018-05-31T21:07:11Z

While testing Gimbal, we discovered an issue where Contour/Envoy would get stuck initializing, and it wouldn't set up listeners nor routes after creating ingress or ingressroute resources. Interestingly, this only seems to happen if Envoy starts up after Contour.

Initially we thought it was related to #2931, but we then realized that we could reproduce the issue with a single "bad service".

In Gimbal, we deploy Contour and Envoy as separate pods. This makes it easier to reproduce, as we can kill the Envoy pod directly. This is not possible in the default Contour deployment. However, I was able to reproduce the issue by killing the Envoy process directly through the admin console.

Steps to reproduce:

Deploy contour

kubectl apply -f https://j.hept.io/contour-deployment-rbac

Create a service with zero endpoints:

kubectl create svc clusterip bad-service --tcp=80

Access Envoy admin portal via port-forward to 9001
Kill envoy using the "/quitquitquit" button.

Once Envoy restarts, it gets stuck "initializing secondary clusters" (see Envoy logs).

Furthermore, listeners do not get created for Ingress resources if there are any. This can be seen in Envoy's admin portal at /listeners:

$ kubectl port-forward contour-6c6558b87b-2tntt 9001 &
[1] 1872
Forwarding from 127.0.0.1:9001 -> 9001
Forwarding from [::1]:9001 -> 9001

$ curl localhost:9001/listeners
Handling connection for 9001
[]

This issue gets resolved by deleting the service we created:

$ kubectl delete svc bad-service
service "bad-service" deleted
$ curl localhost:9001/listeners
Handling connection for 9001
["0.0.0.0:8080"]

The text was updated successfully, but these errors were encountered:

davecheney · 2018-06-01T03:08:02Z

@alexbrand thanks for reporting this issue, could you please attach

the envoy logs
your bootstrap configuration, something like kc exec contour-cb97d6969-7cm24 -c envoy -ti cat /config/contour.yaml

alexbrand · 2018-06-01T12:36:39Z

Thanks @davecheney. Here are logs and config:

1. Logs

k logs contour-6c6558b87b-fkvfm envoy
[2018-06-01 12:34:31.971][1][info][main] source/server/server.cc:178] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-06-01 12:34:31.976][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:127] cm init: initializing cds
[2018-06-01 12:34:31.976][1][info][config] source/server/configuration_impl.cc:52] loading 0 listener(s)
[2018-06-01 12:34:31.976][1][info][config] source/server/configuration_impl.cc:92] loading tracing configuration
[2018-06-01 12:34:31.976][1][info][config] source/server/configuration_impl.cc:119] loading stats sink configuration
[2018-06-01 12:34:31.977][1][info][main] source/server/server.cc:353] starting main dispatch loop
[2018-06-01 12:34:31.980][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/backend-nginx/80 during init
[2018-06-01 12:34:31.980][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/backend-nginx2/80 during init
[2018-06-01 12:34:31.980][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/kuard/80 during init
[2018-06-01 12:34:31.981][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/kubernetes/443 during init
[2018-06-01 12:34:31.981][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/kubernetes/https during init
[2018-06-01 12:34:31.981][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx/80 during init
[2018-06-01 12:34:31.982][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx10/80 during init
[2018-06-01 12:34:31.982][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx11/80 during init
[2018-06-01 12:34:31.982][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx12/80 during init
[2018-06-01 12:34:31.983][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx13/80 during init
[2018-06-01 12:34:31.983][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx14/80 during init
[2018-06-01 12:34:31.983][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx15/80 during init
[2018-06-01 12:34:31.984][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx16/80 during init
[2018-06-01 12:34:31.984][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx17/80 during init
[2018-06-01 12:34:31.984][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx18/80 during init
[2018-06-01 12:34:31.985][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx19/80 during init
[2018-06-01 12:34:31.985][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx2/80 during init
[2018-06-01 12:34:31.985][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx3/80 during init
[2018-06-01 12:34:31.986][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx4/80 during init
[2018-06-01 12:34:31.986][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx5/80 during init
[2018-06-01 12:34:31.987][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx6/80 during init
[2018-06-01 12:34:31.991][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx7/80 during init
[2018-06-01 12:34:31.991][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx8/80 during init
[2018-06-01 12:34:31.994][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx9/80 during init
[2018-06-01 12:34:31.995][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/bad-service/80 during init
[2018-06-01 12:34:31.996][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/443 during init
[2018-06-01 12:34:31.996][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/80 during init
[2018-06-01 12:34:31.997][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/http during init
[2018-06-01 12:34:31.997][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/https during init
[2018-06-01 12:34:31.998][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-dns/53 during init
[2018-06-01 12:34:31.998][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-dns/dns-tcp during init
[2018-06-01 12:34:31.999][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/8080 during init
[2018-06-01 12:34:32.000][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/8081 during init
[2018-06-01 12:34:32.001][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/http-metrics during init
[2018-06-01 12:34:32.002][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/telemetry during init
[2018-06-01 12:34:32.003][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kubernetes-dashboard/80 during init
[2018-06-01 12:34:32.003][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:108] cm init: initializing secondary clusters

2. Config

dynamic_resources:
  lds_config:
    api_config_source:
      api_type: GRPC
      cluster_names: [contour]
      grpc_services:
      - envoy_grpc:
          cluster_name: contour
  cds_config:
    api_config_source:
      api_type: GRPC
      cluster_names: [contour]
      grpc_services:
      - envoy_grpc:
          cluster_name: contour
static_resources:
  clusters:
  - name: contour
    connect_timeout: { seconds: 5 }
    type: STRICT_DNS
    hosts:
    - socket_address:
        address: 127.0.0.1
        port_value: 8001
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}
    circuit_breakers:
      thresholds:
        - priority: high
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
        - priority: default
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
  - name: service_stats
    connect_timeout: 0.250s
    type: LOGICAL_DNS
    lb_policy: ROUND_ROBIN
    hosts:
      - socket_address:
          protocol: TCP
          address: 127.0.0.1
          port_value: 9001
admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9001

davecheney · 2018-06-04T02:06:36Z

Can you please attach the contents of 'badservice'. Thanks

davecheney · 2018-06-04T02:11:50Z

I think this is unrelated to badservice. What I see when I /quitquitquit my envoy container (thanks for the repro) is this

[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/logical_dns_cluster.cc:70] starting async DNS resolution for 127.0.0.1
[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/logical_dns_cluster.cc:78] async DNS resolution complete for 127.0.0.1
[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/upstream_impl.cc:670] starting async DNS resolution for 127.0.0.1
[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/upstream_impl.cc:677] async DNS resolution complete for 127.0.0.1

(once logging has been turned up to debug)

This makes me think that contour cannot resolve the DNS name for 127.0.0.1 to connect to the contour cluster for eds/cds/etc. 127.0.0.1 isn't a DNS name so I'm going to try twiddling with the various dns lookup types.

davecheney · 2018-06-04T02:24:16Z

Nope, it's not related to the use of STATIC_DNS or LOGICAL_DNS service discovery types.

davecheney · 2018-06-04T02:56:33Z

ok, what I think is going on is on restart envoy is not picking up the listeners or not making a connection to LDS

davecheney · 2018-06-04T03:06:20Z

Here's an easier reproducer; swap the order of the envoy and contour containers so contour is running (most likely) before envoy

diff --git a/deployment/deployment-grpc-v2/02-contour.yaml b/deployment/deployment-grpc-v2/02-contour.yaml
index 8791f33..2bf32cb 100644
--- a/deployment/deployment-grpc-v2/02-contour.yaml
+++ b/deployment/deployment-grpc-v2/02-contour.yaml
@@ -21,6 +21,11 @@ spec:
         prometheus.io/format: "prometheus"
     spec:
       containers:
+      - image: docker.io/davecheney/contour:latest
+        name: contour
+        imagePullPolicy: Always
+        command: ["contour"]
+        args: ["serve", "--incluster"]
       - image: docker.io/envoyproxy/envoy-alpine:v1.6.0
         name: envoy
         ports:
@@ -33,13 +38,8 @@ spec:
         volumeMounts:
         - name: contour-config
           mountPath: /config
-      - image: gcr.io/heptio-images/contour:master
-        imagePullPolicy: Always
-        name: contour
-        command: ["contour"]
-        args: ["serve", "--incluster"]
       initContainers:
-      - image: gcr.io/heptio-images/contour:master
+      - image: docker.io/davecheney/contour:latest
         imagePullPolicy: Always
         name: envoy-initconfig
         command: ["contour"]

This increases the changes that contour will be running so the connection during init will succeed, rather than envoy's init completing successfully and then launching the full eds song and dance.

Updates projectcontour#420 Prior to this PR the envoy container appeared first in the deployment.spec so would likely start before the contour container. This would cause envoy to fail to be able to contact contour immediately upon startup and would then fall back to a retry behaviour. With this change, contour appears first in the spec so will likely start before contour, which agrevates a failure mode reported in issue projectcontour#420. Signed-off-by: Dave Cheney <dave@cheney.net>

Updates projectcontour#420 If stats are not enabled, there is no point in creating a static stats cluster as it will not be used and generates log spam. Signed-off-by: Dave Cheney <dave@cheney.net>

davecheney · 2018-06-04T03:30:48Z

Ok, feeling more confident that envoy isn't fetching listeners iff it gets a bunch of cds config during init.

looking in the backend stats I see

listener_manager.lds.update_attempt: 0
listener_manager.lds.update_failure: 0
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 0
listener_manager.lds.version: 0
listener_manager.listener_added: 0
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 0
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 0
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0

But in /clusters we do have clusters created, so something is failing to trigger envoy to talk LDS.

davecheney · 2018-06-04T04:00:10Z

Possibly a recurrence of envoyproxy/envoy#2931.

The symptoms are very similar, envoy doesn't even try to talk LDS.

davecheney · 2018-06-04T04:16:11Z

Raised envoyproxy/envoy#3530

Fixes projectcontour#420 In projectcontour#350 I added the ability to filter the results of a discovery request to address the scalability issues in sending the full copy of the xDS caches to envoy on every request -- this was especially imporant for EDS which would send an entire copy of the EDS cache to each StreamEndpoints listener for every change in k8s endpoints. As part of this change I recognised that there would be situations where a notification that the cache had changed would result a wakeup and calls to values for events which didn't match the filter. At the time I mistakenly thought that if a wakeup was triggered that didn't match the filter this would cause zero length responses to be sent to Envoy causing EDS entries (and others) to pop in and out of existance. For example, if you were watching for endpoints matching "A", then an update happened to endpoints matching "B", I believed that when the "A" watcher ran it would receive an empty set of endpoints and send those back to envoy, effectively erasing the active set on the Envoy side. In fact what happens, and what still happens, is a notification that the cache to the EDS cache wakes up all watchers, they run Values(), passing in their filter. This returns whatever results match the filter, which critically will be whatever is in the eds cache at the time. If there are several results for "A" and the EDS cache was updated for "B", then both "A" and "B"'s watchers will be woken up, they'll both call Values() and then send the set of endpoints back to Envoy. In that case the change to B will be reflected and the change to A will be a no-op. What I tried to avoid was the update to "B" causing the Values for "A" running and finding no endpoints, because I misunderstood that an update to "B"'s endpoints would make it appears that "A"'s endpoint set was empty. This was not the case, and will only be the case if "A"'s endpoint set is actually empty. In the case that "A"'s endpoint set was empty, the logic prior to this PR would skip the notification and wait for the number of endpoints to grow above zero. This _sort of_ made sense when CDS and EDS entries would arrive after startup, but was fundamentally flawed. EDS entries with zero values are a valid response and don't represent a transient state of a filter not matching. Where this bug arose is during startup Envoy _may_ query (depends on timing) EDS values synchronously or asynchronously, and in the former Envoy will first look in CDS for the names of all clusters, then query EDS for the endpoints of all the clusters previously mentioned. If the EDS entry had zero endpoints, the notification would be suppressed and thus Envoy's startup would stall. The fix to this bug is simple, always send the response we get back from Values. If it's empty, then that is because there are no endpoints registered, regardless of the filter applied to trim out endpoints that were not requested. Again, as with projectcontour#350 there is no test, other than the existing test which prove that existing functionality (so far as it is covered) is unaffected. However manual testing, in the form of moving contour first in the deployment descriptor which triggers Envoys synchronous startup is part of projectcontour#422, which should give some coverage. Signed-off-by: Dave Cheney <dave@cheney.net>

Updates projectcontour#420 Prior to this PR the envoy container appeared first in the deployment.spec so would likely start before the contour container. This would cause envoy to fail to be able to contact contour immediately upon startup and would then fall back to a retry behaviour. With this change, contour appears first in the spec so will likely start before contour, which agrevates a failure mode reported in issue projectcontour#420. Signed-off-by: Dave Cheney <dave@cheney.net>

davecheney · 2018-06-07T03:29:09Z

@mattmoyer can you tell me why you reopen this please.

mattmoyer · 2018-06-11T15:36:47Z

Oops, I think that must have been a misclick.

Updates projectcontour#420 If stats are not enabled, there is no point in creating a static stats cluster as it will not be used and generates log spam. Signed-off-by: Dave Cheney <dave@cheney.net>

This change should put us in a better place once we start moving parts of Contour and the Operator around. We can lean more heavily on this test suite rather than the e2e test suite since it should be faster to run/write etc. This code probably duplicates some things from the e2e suite and things from that suite could be brought into here to run faster and with less flakes. Those should be future changes after this starting point. - Moved from Ginkgo to regular go tests (no need for extra ginkgo complexity yet) - Ports existing tests - Add additional test coverage - Some small changes to how controller logs handling of namespace deletion (was confusing to see that the namespace was deleted when it wasnt actually b/c remove on delete was false) Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

davecheney added the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Jun 1, 2018

davecheney added this to the 0.6.0 milestone Jun 4, 2018

davecheney self-assigned this Jun 4, 2018

davecheney mentioned this issue Jun 4, 2018

deployment: switch the order of envoy and contour containers #422

Merged

davecheney mentioned this issue Jun 4, 2018

internal/envoy: don't create service_stats cluster if stats disabled #423

Closed

This was referenced Jun 5, 2018

internal/grpc: always send discoveryresponse with zero results #425

Merged

internal/grpc: move filtering of Values to Register #426

Closed

davecheney closed this as completed in #425 Jun 5, 2018

mattmoyer reopened this Jun 6, 2018

mattmoyer closed this as completed Jun 11, 2018

This was referenced May 17, 2019

Envoy hangs when starting after upgrading to Contour 0.12.0 #1091

Closed

internal/{grpc,contour}: EDS always returns a value #1110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy gets stuck "initializing secondary clusters" after a restart when service has zero endpoints #420

Envoy gets stuck "initializing secondary clusters" after a restart when service has zero endpoints #420

alexbrand commented May 31, 2018

davecheney commented Jun 1, 2018

alexbrand commented Jun 1, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018 •

edited

Loading

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 7, 2018

mattmoyer commented Jun 11, 2018

Envoy gets stuck "initializing secondary clusters" after a restart when service has zero endpoints #420

Envoy gets stuck "initializing secondary clusters" after a restart when service has zero endpoints #420

Comments

alexbrand commented May 31, 2018

davecheney commented Jun 1, 2018

alexbrand commented Jun 1, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018 • edited Loading

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 4, 2018

davecheney commented Jun 7, 2018

mattmoyer commented Jun 11, 2018

davecheney commented Jun 4, 2018 •

edited

Loading