Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy gets stuck "initializing secondary clusters" after a restart when service has zero endpoints #420

Closed
alexbrand opened this issue May 31, 2018 · 12 comments · Fixed by #425
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@alexbrand
Copy link
Contributor

While testing Gimbal, we discovered an issue where Contour/Envoy would get stuck initializing, and it wouldn't set up listeners nor routes after creating ingress or ingressroute resources. Interestingly, this only seems to happen if Envoy starts up after Contour.

Initially we thought it was related to #2931, but we then realized that we could reproduce the issue with a single "bad service".

In Gimbal, we deploy Contour and Envoy as separate pods. This makes it easier to reproduce, as we can kill the Envoy pod directly. This is not possible in the default Contour deployment. However, I was able to reproduce the issue by killing the Envoy process directly through the admin console.

Steps to reproduce:

  1. Deploy contour
kubectl apply -f https://j.hept.io/contour-deployment-rbac
  1. Create a service with zero endpoints:
kubectl create svc clusterip bad-service --tcp=80
  1. Access Envoy admin portal via port-forward to 9001
  2. Kill envoy using the "/quitquitquit" button.

Once Envoy restarts, it gets stuck "initializing secondary clusters" (see Envoy logs).

Furthermore, listeners do not get created for Ingress resources if there are any. This can be seen in Envoy's admin portal at /listeners:

$ kubectl port-forward contour-6c6558b87b-2tntt 9001 &
[1] 1872
Forwarding from 127.0.0.1:9001 -> 9001
Forwarding from [::1]:9001 -> 9001

$ curl localhost:9001/listeners
Handling connection for 9001
[]

This issue gets resolved by deleting the service we created:

$ kubectl delete svc bad-service
service "bad-service" deleted
$ curl localhost:9001/listeners
Handling connection for 9001
["0.0.0.0:8080"]
@davecheney
Copy link
Contributor

@alexbrand thanks for reporting this issue, could you please attach

  1. the envoy logs
  2. your bootstrap configuration, something like kc exec contour-cb97d6969-7cm24 -c envoy -ti cat /config/contour.yaml

@davecheney davecheney added the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Jun 1, 2018
@alexbrand
Copy link
Contributor Author

Thanks @davecheney. Here are logs and config:

1. Logs

k logs contour-6c6558b87b-fkvfm envoy
[2018-06-01 12:34:31.971][1][info][main] source/server/server.cc:178] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-06-01 12:34:31.976][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:127] cm init: initializing cds
[2018-06-01 12:34:31.976][1][info][config] source/server/configuration_impl.cc:52] loading 0 listener(s)
[2018-06-01 12:34:31.976][1][info][config] source/server/configuration_impl.cc:92] loading tracing configuration
[2018-06-01 12:34:31.976][1][info][config] source/server/configuration_impl.cc:119] loading stats sink configuration
[2018-06-01 12:34:31.977][1][info][main] source/server/server.cc:353] starting main dispatch loop
[2018-06-01 12:34:31.980][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/backend-nginx/80 during init
[2018-06-01 12:34:31.980][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/backend-nginx2/80 during init
[2018-06-01 12:34:31.980][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/kuard/80 during init
[2018-06-01 12:34:31.981][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/kubernetes/443 during init
[2018-06-01 12:34:31.981][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/kubernetes/https during init
[2018-06-01 12:34:31.981][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx/80 during init
[2018-06-01 12:34:31.982][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx10/80 during init
[2018-06-01 12:34:31.982][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx11/80 during init
[2018-06-01 12:34:31.982][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx12/80 during init
[2018-06-01 12:34:31.983][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx13/80 during init
[2018-06-01 12:34:31.983][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx14/80 during init
[2018-06-01 12:34:31.983][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx15/80 during init
[2018-06-01 12:34:31.984][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx16/80 during init
[2018-06-01 12:34:31.984][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx17/80 during init
[2018-06-01 12:34:31.984][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx18/80 during init
[2018-06-01 12:34:31.985][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx19/80 during init
[2018-06-01 12:34:31.985][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx2/80 during init
[2018-06-01 12:34:31.985][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx3/80 during init
[2018-06-01 12:34:31.986][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx4/80 during init
[2018-06-01 12:34:31.986][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx5/80 during init
[2018-06-01 12:34:31.987][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx6/80 during init
[2018-06-01 12:34:31.991][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx7/80 during init
[2018-06-01 12:34:31.991][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx8/80 during init
[2018-06-01 12:34:31.994][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster default/nginx9/80 during init
[2018-06-01 12:34:31.995][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/bad-service/80 during init
[2018-06-01 12:34:31.996][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/443 during init
[2018-06-01 12:34:31.996][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/80 during init
[2018-06-01 12:34:31.997][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/http during init
[2018-06-01 12:34:31.997][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster heptio-contour/contour/https during init
[2018-06-01 12:34:31.998][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-dns/53 during init
[2018-06-01 12:34:31.998][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-dns/dns-tcp during init
[2018-06-01 12:34:31.999][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/8080 during init
[2018-06-01 12:34:32.000][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/8081 during init
[2018-06-01 12:34:32.001][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/http-metrics during init
[2018-06-01 12:34:32.002][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kube-state-metrics/telemetry during init
[2018-06-01 12:34:32.003][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:382] add/update cluster kube-system/kubernetes-dashboard/80 during init
[2018-06-01 12:34:32.003][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:108] cm init: initializing secondary clusters

2. Config

dynamic_resources:
  lds_config:
    api_config_source:
      api_type: GRPC
      cluster_names: [contour]
      grpc_services:
      - envoy_grpc:
          cluster_name: contour
  cds_config:
    api_config_source:
      api_type: GRPC
      cluster_names: [contour]
      grpc_services:
      - envoy_grpc:
          cluster_name: contour
static_resources:
  clusters:
  - name: contour
    connect_timeout: { seconds: 5 }
    type: STRICT_DNS
    hosts:
    - socket_address:
        address: 127.0.0.1
        port_value: 8001
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}
    circuit_breakers:
      thresholds:
        - priority: high
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
        - priority: default
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
  - name: service_stats
    connect_timeout: 0.250s
    type: LOGICAL_DNS
    lb_policy: ROUND_ROBIN
    hosts:
      - socket_address:
          protocol: TCP
          address: 127.0.0.1
          port_value: 9001
admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9001

@davecheney
Copy link
Contributor

Can you please attach the contents of 'badservice'. Thanks

@davecheney
Copy link
Contributor

I think this is unrelated to badservice. What I see when I /quitquitquit my envoy container (thanks for the repro) is this

[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/logical_dns_cluster.cc:70] starting async DNS resolution for 127.0.0.1
[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/logical_dns_cluster.cc:78] async DNS resolution complete for 127.0.0.1
[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/upstream_impl.cc:670] starting async DNS resolution for 127.0.0.1
[2018-06-04 02:09:40.498][1][debug][upstream] source/common/upstream/upstream_impl.cc:677] async DNS resolution complete for 127.0.0.1

(once logging has been turned up to debug)

This makes me think that contour cannot resolve the DNS name for 127.0.0.1 to connect to the contour cluster for eds/cds/etc. 127.0.0.1 isn't a DNS name so I'm going to try twiddling with the various dns lookup types.

@davecheney davecheney added this to the 0.6.0 milestone Jun 4, 2018
@davecheney davecheney added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. labels Jun 4, 2018
@davecheney davecheney self-assigned this Jun 4, 2018
@davecheney
Copy link
Contributor

Nope, it's not related to the use of STATIC_DNS or LOGICAL_DNS service discovery types.

@davecheney
Copy link
Contributor

ok, what I think is going on is on restart envoy is not picking up the listeners or not making a connection to LDS

@davecheney
Copy link
Contributor

davecheney commented Jun 4, 2018

Here's an easier reproducer; swap the order of the envoy and contour containers so contour is running (most likely) before envoy

diff --git a/deployment/deployment-grpc-v2/02-contour.yaml b/deployment/deployment-grpc-v2/02-contour.yaml
index 8791f33..2bf32cb 100644
--- a/deployment/deployment-grpc-v2/02-contour.yaml
+++ b/deployment/deployment-grpc-v2/02-contour.yaml
@@ -21,6 +21,11 @@ spec:
         prometheus.io/format: "prometheus"
     spec:
       containers:
+      - image: docker.io/davecheney/contour:latest
+        name: contour
+        imagePullPolicy: Always
+        command: ["contour"]
+        args: ["serve", "--incluster"]
       - image: docker.io/envoyproxy/envoy-alpine:v1.6.0
         name: envoy
         ports:
@@ -33,13 +38,8 @@ spec:
         volumeMounts:
         - name: contour-config
           mountPath: /config
-      - image: gcr.io/heptio-images/contour:master
-        imagePullPolicy: Always
-        name: contour
-        command: ["contour"]
-        args: ["serve", "--incluster"]
       initContainers:
-      - image: gcr.io/heptio-images/contour:master
+      - image: docker.io/davecheney/contour:latest
         imagePullPolicy: Always
         name: envoy-initconfig
         command: ["contour"]

This increases the changes that contour will be running so the connection during init will succeed, rather than envoy's init completing successfully and then launching the full eds song and dance.

davecheney added a commit to davecheney/contour that referenced this issue Jun 4, 2018
Updates projectcontour#420

Prior to this PR the envoy container appeared first in the
deployment.spec so would likely start before the contour container. This
would cause envoy to fail to be able to contact contour immediately upon
startup and would then fall back to a retry behaviour.

With this change, contour appears first in the spec so will likely start
before contour, which agrevates a failure mode reported in issue projectcontour#420.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Jun 4, 2018
Updates projectcontour#420

If stats are not enabled, there is no point in creating a static stats
cluster as it will not be used and generates log spam.

Signed-off-by: Dave Cheney <dave@cheney.net>
@davecheney
Copy link
Contributor

Ok, feeling more confident that envoy isn't fetching listeners iff it gets a bunch of cds config during init.

looking in the backend stats I see

listener_manager.lds.update_attempt: 0
listener_manager.lds.update_failure: 0
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 0
listener_manager.lds.version: 0
listener_manager.listener_added: 0
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 0
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 0
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0

But in /clusters we do have clusters created, so something is failing to trigger envoy to talk LDS.

@davecheney
Copy link
Contributor

Possibly a recurrence of envoyproxy/envoy#2931.

The symptoms are very similar, envoy doesn't even try to talk LDS.

@davecheney
Copy link
Contributor

Raised envoyproxy/envoy#3530

davecheney added a commit to davecheney/contour that referenced this issue Jun 5, 2018
Fixes projectcontour#420

In projectcontour#350 I added the ability to filter the results of a discovery
request to address the scalability issues in sending the full copy of
the xDS caches to envoy on every request -- this was especially imporant
for EDS which would send an entire copy of the EDS cache to each
StreamEndpoints listener for every change in k8s endpoints.

As part of this change I recognised that there would be situations where
a notification that the cache had changed would result a wakeup and
calls to values for events which didn't match the filter. At the time I
mistakenly thought that if a wakeup was triggered that didn't match the
filter this would cause zero length responses to be sent to Envoy
causing EDS entries (and others) to pop in and out of existance.

For example, if you were watching for endpoints matching "A", then an
update happened to endpoints matching "B", I believed that when the "A"
watcher ran it would receive an empty set of endpoints and send those
back to envoy, effectively erasing the active set on the Envoy side.

In fact what happens, and what still happens, is a notification that the
cache to the EDS cache wakes up all watchers, they run Values(), passing
in their filter. This returns whatever results match the filter, which
critically will be whatever is in the eds cache at the time. If there
are several results for "A" and the EDS cache was updated for "B", then
both "A" and "B"'s watchers will be woken up, they'll both call Values()
and then send the set of endpoints back to Envoy. In that case the
change to B will be reflected and the change to A will be a no-op.

What I tried to avoid was the update to "B" causing the Values for "A"
running and finding no endpoints, because I misunderstood that an update
to "B"'s endpoints would make it appears that "A"'s endpoint set was
empty. This was not the case, and will only be the case if "A"'s
endpoint set is actually empty.

In the case that "A"'s endpoint set was empty, the logic prior to
this PR would skip the notification and wait for the number of endpoints
to grow above zero. This _sort of_ made sense when CDS and EDS entries
would arrive after startup, but was fundamentally flawed. EDS entries
with zero values are a valid response and don't represent a transient
state of a filter not matching.

Where this bug arose is during startup Envoy _may_ query (depends on
timing) EDS values synchronously or asynchronously, and in the former
Envoy will first look in CDS for the names of all clusters, then query
EDS for the endpoints of all the clusters previously mentioned. If the
EDS entry had zero endpoints, the notification would be suppressed and
thus Envoy's startup would stall.

The fix to this bug is simple, always send the response we get back from
Values. If it's empty, then that is because there are no endpoints
registered, regardless of the filter applied to trim out endpoints that
were not requested.

Again, as with projectcontour#350 there is no test, other than the existing test which
prove that existing functionality (so far as it is covered) is
unaffected. However manual testing, in the form of moving contour first
in the deployment descriptor which triggers Envoys synchronous startup
is part of projectcontour#422, which should give some coverage.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Jun 5, 2018
Fixes projectcontour#420

In projectcontour#350 I added the ability to filter the results of a discovery
request to address the scalability issues in sending the full copy of
the xDS caches to envoy on every request -- this was especially imporant
for EDS which would send an entire copy of the EDS cache to each
StreamEndpoints listener for every change in k8s endpoints.

As part of this change I recognised that there would be situations where
a notification that the cache had changed would result a wakeup and
calls to values for events which didn't match the filter. At the time I
mistakenly thought that if a wakeup was triggered that didn't match the
filter this would cause zero length responses to be sent to Envoy
causing EDS entries (and others) to pop in and out of existance.

For example, if you were watching for endpoints matching "A", then an
update happened to endpoints matching "B", I believed that when the "A"
watcher ran it would receive an empty set of endpoints and send those
back to envoy, effectively erasing the active set on the Envoy side.

In fact what happens, and what still happens, is a notification that the
cache to the EDS cache wakes up all watchers, they run Values(), passing
in their filter. This returns whatever results match the filter, which
critically will be whatever is in the eds cache at the time. If there
are several results for "A" and the EDS cache was updated for "B", then
both "A" and "B"'s watchers will be woken up, they'll both call Values()
and then send the set of endpoints back to Envoy. In that case the
change to B will be reflected and the change to A will be a no-op.

What I tried to avoid was the update to "B" causing the Values for "A"
running and finding no endpoints, because I misunderstood that an update
to "B"'s endpoints would make it appears that "A"'s endpoint set was
empty. This was not the case, and will only be the case if "A"'s
endpoint set is actually empty.

In the case that "A"'s endpoint set was empty, the logic prior to
this PR would skip the notification and wait for the number of endpoints
to grow above zero. This _sort of_ made sense when CDS and EDS entries
would arrive after startup, but was fundamentally flawed. EDS entries
with zero values are a valid response and don't represent a transient
state of a filter not matching.

Where this bug arose is during startup Envoy _may_ query (depends on
timing) EDS values synchronously or asynchronously, and in the former
Envoy will first look in CDS for the names of all clusters, then query
EDS for the endpoints of all the clusters previously mentioned. If the
EDS entry had zero endpoints, the notification would be suppressed and
thus Envoy's startup would stall.

The fix to this bug is simple, always send the response we get back from
Values. If it's empty, then that is because there are no endpoints
registered, regardless of the filter applied to trim out endpoints that
were not requested.

Again, as with projectcontour#350 there is no test, other than the existing test which
prove that existing functionality (so far as it is covered) is
unaffected. However manual testing, in the form of moving contour first
in the deployment descriptor which triggers Envoys synchronous startup
is part of projectcontour#422, which should give some coverage.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Jun 5, 2018
Updates projectcontour#420

Prior to this PR the envoy container appeared first in the
deployment.spec so would likely start before the contour container. This
would cause envoy to fail to be able to contact contour immediately upon
startup and would then fall back to a retry behaviour.

With this change, contour appears first in the spec so will likely start
before contour, which agrevates a failure mode reported in issue projectcontour#420.

Signed-off-by: Dave Cheney <dave@cheney.net>
@mattmoyer mattmoyer reopened this Jun 6, 2018
@davecheney
Copy link
Contributor

@mattmoyer can you tell me why you reopen this please.

@mattmoyer
Copy link

Oops, I think that must have been a misclick.

davecheney added a commit to davecheney/contour that referenced this issue Jul 6, 2018
Updates projectcontour#420

If stats are not enabled, there is no point in creating a static stats
cluster as it will not be used and generates log spam.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Oct 23, 2018
Updates projectcontour#420

If stats are not enabled, there is no point in creating a static stats
cluster as it will not be used and generates log spam.

Signed-off-by: Dave Cheney <dave@cheney.net>
sunjayBhatia added a commit that referenced this issue Jan 30, 2023
This change should put us in a better place once we start moving parts
of Contour and the Operator around.

We can lean more heavily on this test suite rather than the e2e test
suite since it should be faster to run/write etc.

This code probably duplicates some things from the e2e suite and things
from that suite could be brought into here to run faster and with less
flakes. Those should be future changes after this starting point.

- Moved from Ginkgo to regular go tests (no need for extra ginkgo
complexity yet)
- Ports existing tests
- Add additional test coverage
- Some small changes to how controller logs handling of namespace
deletion (was confusing to see that the namespace was deleted when it
wasnt actually b/c remove on delete was false)

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants