Service endpoint changes not updated in envoy #293

drobinson123 · 2018-03-19T21:46:25Z

I'm seeing some weird issues with Contour 0.4. It seems like Contour configures Envoy correctly upon startup but then fails to keep envoy updated as resources change (service endpoints specifically). If I restart a contour pod it starts with the configuration I expect to see, and it routes requests correctly, that is until endpoints change. Here's what I see in the logs upon startup -- should I be concerned by the two "gRPC update" messages?

$ kubectl -n heptio-contour logs contour-gv9gk -c envoy -f
[2018-03-19 19:52:23.066][1][info][main] source/server/server.cc:178] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-03-19 19:52:23.072][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:128] cm init: initializing cds
[2018-03-19 19:52:23.073][1][info][config] source/server/configuration_impl.cc:52] loading 0 listener(s)
[2018-03-19 19:52:23.073][1][info][config] source/server/configuration_impl.cc:92] loading tracing configuration
[2018-03-19 19:52:23.073][1][info][config] source/server/configuration_impl.cc:119] loading stats sink configuration
[2018-03-19 19:52:23.073][1][info][main] source/server/server.cc:353] starting main dispatch loop
[2018-03-19 19:52:23.075][1][warning][upstream] source/common/config/grpc_mux_impl.cc:205] gRPC config stream closed: 1,
[2018-03-19 19:52:23.075][1][warning][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:66] gRPC update for type.googleapis.com/envoy.api.v2.Cluster failed
[2018-03-19 19:52:23.075][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-03-19 19:52:23.075][1][info][main] source/server/server.cc:337] all clusters initialized. initializing init manager
[2018-03-19 19:52:23.075][1][warning][upstream] source/common/config/grpc_mux_impl.cc:205] gRPC config stream closed: 1,
[2018-03-19 19:52:23.075][1][warning][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:66] gRPC update for type.googleapis.com/envoy.api.v2.Listener failed
[2018-03-19 19:52:23.075][1][info][config] source/server/listener_manager_impl.cc:583] all dependencies initialized. starting workers

Contour is running in AWS, with NLB, and TLS. It was deployed w/ the ds-hostnet config w/ the changes below (for TLS):

--- deployment/ds-hostnet/02-contour.yaml
+++ deployment/ds-hostnet/02-contour.yaml
@@ -28,8 +28,10 @@ spec:
         ports:
         - containerPort: 8080
           name: http
+        - containerPort: 8443
+          name: https
         command: ["envoy"]
-        args: ["-c", "/config/contour.yaml", "--service-cluster", "cluster0", "--service-node", "node0"]
+        args: ["-c", "/config/contour.yaml", "--service-cluster", "cluster0", "--service-node", "node0", "-l", "info", "--v2-config-only"]
         volumeMounts:
         - name: contour-config
           mountPath: /config

Snippet from envoy's /clusters endpoint when routing is broken:

admin/fleet/8080::default_priority::max_connections::1024
admin/fleet/8080::default_priority::max_pending_requests::1024
admin/fleet/8080::default_priority::max_requests::1024
admin/fleet/8080::default_priority::max_retries::3
admin/fleet/8080::high_priority::max_connections::1024
admin/fleet/8080::high_priority::max_pending_requests::1024
admin/fleet/8080::high_priority::max_requests::1024
admin/fleet/8080::high_priority::max_retries::3
admin/fleet/8080::added_via_api::true

Snippet from envoy's /clusters endpoint when routing is working:

admin/fleet/8080::default_priority::max_connections::1024
admin/fleet/8080::default_priority::max_pending_requests::1024
admin/fleet/8080::default_priority::max_requests::1024
admin/fleet/8080::default_priority::max_retries::3
admin/fleet/8080::high_priority::max_connections::1024
admin/fleet/8080::high_priority::max_pending_requests::1024
admin/fleet/8080::high_priority::max_requests::1024
admin/fleet/8080::high_priority::max_retries::3
admin/fleet/8080::added_via_api::true
admin/fleet/8080::100.96.1.166:8080::cx_active::0
admin/fleet/8080::100.96.1.166:8080::cx_connect_fail::0
admin/fleet/8080::100.96.1.166:8080::cx_total::0
admin/fleet/8080::100.96.1.166:8080::rq_active::0
admin/fleet/8080::100.96.1.166:8080::rq_error::0
admin/fleet/8080::100.96.1.166:8080::rq_success::0
admin/fleet/8080::100.96.1.166:8080::rq_timeout::0
admin/fleet/8080::100.96.1.166:8080::rq_total::0
admin/fleet/8080::100.96.1.166:8080::health_flags::healthy
admin/fleet/8080::100.96.1.166:8080::weight::1
admin/fleet/8080::100.96.1.166:8080::region::
admin/fleet/8080::100.96.1.166:8080::zone::
admin/fleet/8080::100.96.1.166:8080::sub_zone::
admin/fleet/8080::100.96.1.166:8080::canary::false
admin/fleet/8080::100.96.1.166:8080::success_rate::-1

The text was updated successfully, but these errors were encountered:

davecheney · 2018-03-19T21:51:27Z

Dont' worryt about those warnings during startup. Envoy starts before contour and tries to make a connection to contour which fails the first time (mostly).

Updates projectcontour#293 This PR moves `cmd/contourcli` into the main `cmd/contour` binary so that it can be used via kubectl exec. Signed-off-by: Dave Cheney <dave@cheney.net>

drobinson123 · 2018-03-20T02:55:09Z

The contour cli eds command shows some interesting behavior.

Original state: kuard deployment has 3 pods (and 3 endpoints):

$ kubectl -n admin get deployment kuard
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kuard     3         3         3            3           4h

contour cli eds shows:

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.203"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.2.130"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.132"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>

And /clusters shows:

$ curl -s localhost:9001/clusters | grep kuard
admin/kuard/80::default_priority::max_connections::1024
admin/kuard/80::default_priority::max_pending_requests::1024
admin/kuard/80::default_priority::max_requests::1024
admin/kuard/80::default_priority::max_retries::3
admin/kuard/80::high_priority::max_connections::1024
admin/kuard/80::high_priority::max_pending_requests::1024
admin/kuard/80::high_priority::max_requests::1024
admin/kuard/80::high_priority::max_retries::3
admin/kuard/80::added_via_api::true
admin/kuard/80::100.96.1.203:8080::cx_active::0
admin/kuard/80::100.96.1.203:8080::cx_connect_fail::0
admin/kuard/80::100.96.1.203:8080::cx_total::0
admin/kuard/80::100.96.1.203:8080::rq_active::0
admin/kuard/80::100.96.1.203:8080::rq_error::0
admin/kuard/80::100.96.1.203:8080::rq_success::0
admin/kuard/80::100.96.1.203:8080::rq_timeout::0
admin/kuard/80::100.96.1.203:8080::rq_total::0
admin/kuard/80::100.96.1.203:8080::health_flags::healthy
admin/kuard/80::100.96.1.203:8080::weight::1
admin/kuard/80::100.96.1.203:8080::region::
admin/kuard/80::100.96.1.203:8080::zone::
admin/kuard/80::100.96.1.203:8080::sub_zone::
admin/kuard/80::100.96.1.203:8080::canary::false
admin/kuard/80::100.96.1.203:8080::success_rate::-1
admin/kuard/80::100.96.2.130:8080::cx_active::0
admin/kuard/80::100.96.2.130:8080::cx_connect_fail::0
admin/kuard/80::100.96.2.130:8080::cx_total::0
admin/kuard/80::100.96.2.130:8080::rq_active::0
admin/kuard/80::100.96.2.130:8080::rq_error::0
admin/kuard/80::100.96.2.130:8080::rq_success::0
admin/kuard/80::100.96.2.130:8080::rq_timeout::0
admin/kuard/80::100.96.2.130:8080::rq_total::0
admin/kuard/80::100.96.2.130:8080::health_flags::healthy
admin/kuard/80::100.96.2.130:8080::weight::1
admin/kuard/80::100.96.2.130:8080::region::
admin/kuard/80::100.96.2.130:8080::zone::
admin/kuard/80::100.96.2.130:8080::sub_zone::
admin/kuard/80::100.96.2.130:8080::canary::false
admin/kuard/80::100.96.2.130:8080::success_rate::-1
admin/kuard/80::100.96.3.132:8080::cx_active::0
admin/kuard/80::100.96.3.132:8080::cx_connect_fail::0
admin/kuard/80::100.96.3.132:8080::cx_total::0
admin/kuard/80::100.96.3.132:8080::rq_active::0
admin/kuard/80::100.96.3.132:8080::rq_error::0
admin/kuard/80::100.96.3.132:8080::rq_success::0
admin/kuard/80::100.96.3.132:8080::rq_timeout::0
admin/kuard/80::100.96.3.132:8080::rq_total::0
admin/kuard/80::100.96.3.132:8080::health_flags::healthy
admin/kuard/80::100.96.3.132:8080::weight::1
admin/kuard/80::100.96.3.132:8080::region::
admin/kuard/80::100.96.3.132:8080::zone::
admin/kuard/80::100.96.3.132:8080::sub_zone::
admin/kuard/80::100.96.3.132:8080::canary::false

Then upon deleting the deployment I have 0 pods but 2 endpoints:

$ kubectl -n admin delete deployment kuard
deployment "kuard" deleted
$ kubectl -n admin get deployment kuard
Error from server (NotFound): deployments.extensions "kuard" not found

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.203"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.2.130"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>

$ curl -s localhost:9001/clusters | grep kuard
admin/kuard/80::default_priority::max_connections::1024
admin/kuard/80::default_priority::max_pending_requests::1024
admin/kuard/80::default_priority::max_requests::1024
admin/kuard/80::default_priority::max_retries::3
admin/kuard/80::high_priority::max_connections::1024
admin/kuard/80::high_priority::max_pending_requests::1024
admin/kuard/80::high_priority::max_requests::1024
admin/kuard/80::high_priority::max_retries::3
admin/kuard/80::added_via_api::true
admin/kuard/80::100.96.1.203:8080::cx_active::0
admin/kuard/80::100.96.1.203:8080::cx_connect_fail::0
admin/kuard/80::100.96.1.203:8080::cx_total::0
admin/kuard/80::100.96.1.203:8080::rq_active::0
admin/kuard/80::100.96.1.203:8080::rq_error::0
admin/kuard/80::100.96.1.203:8080::rq_success::0
admin/kuard/80::100.96.1.203:8080::rq_timeout::0
admin/kuard/80::100.96.1.203:8080::rq_total::0
admin/kuard/80::100.96.1.203:8080::health_flags::healthy
admin/kuard/80::100.96.1.203:8080::weight::1
admin/kuard/80::100.96.1.203:8080::region::
admin/kuard/80::100.96.1.203:8080::zone::
admin/kuard/80::100.96.1.203:8080::sub_zone::
admin/kuard/80::100.96.1.203:8080::canary::false
admin/kuard/80::100.96.1.203:8080::success_rate::-1
admin/kuard/80::100.96.2.130:8080::cx_active::0
admin/kuard/80::100.96.2.130:8080::cx_connect_fail::0
admin/kuard/80::100.96.2.130:8080::cx_total::0
admin/kuard/80::100.96.2.130:8080::rq_active::0
admin/kuard/80::100.96.2.130:8080::rq_error::0
admin/kuard/80::100.96.2.130:8080::rq_success::0
admin/kuard/80::100.96.2.130:8080::rq_timeout::0
admin/kuard/80::100.96.2.130:8080::rq_total::0
admin/kuard/80::100.96.2.130:8080::health_flags::healthy
admin/kuard/80::100.96.2.130:8080::weight::1
admin/kuard/80::100.96.2.130:8080::region::
admin/kuard/80::100.96.2.130:8080::zone::
admin/kuard/80::100.96.2.130:8080::sub_zone::
admin/kuard/80::100.96.2.130:8080::canary::false
admin/kuard/80::100.96.2.130:8080::success_rate::-1

drobinson123 · 2018-03-20T03:12:27Z

cli eds still shows the 2 endpoints when the svc is deleted, but /clusters does not.

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.203"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.2.130"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>

$ curl -s localhost:9001/clusters | grep kuard
$

I removed the ingress then created the svc and deployment again. cli eds showed 1, then 2, then all 3 endpoints, but they are still missing from /clusters.

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.136"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.205"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.136"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.205"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.136"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.5.39"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>

$ curl -s localhost:9001/clusters | grep kuard
admin/kuard/80::default_priority::max_connections::1024
admin/kuard/80::default_priority::max_pending_requests::1024
admin/kuard/80::default_priority::max_requests::1024
admin/kuard/80::default_priority::max_retries::3
admin/kuard/80::high_priority::max_connections::1024
admin/kuard/80::high_priority::max_pending_requests::1024
admin/kuard/80::high_priority::max_requests::1024
admin/kuard/80::high_priority::max_retries::3
admin/kuard/80::added_via_api::true

Updates projectcontour#293 This PR moves `cmd/contourcli` into the main `cmd/contour` binary so that it can be used via kubectl exec. Signed-off-by: Dave Cheney <dave@cheney.net>

davecheney · 2018-04-03T04:58:50Z

Thanks to Alexander Lukyanchenko (@Lookyan) we have increased the general gRPC limits on both the Envoy client and Contour server well above anything that should be an issue for the immediate future.

The symptoms of hitting gRPC limits vary, but are basically "envoy doesn't see changes in the API server until I restart ". The underlying cause is likely to be that you have a large (more than 100, possibly 200, the exact limit is not precisely known) number of Service objects in your cluster -- these don't have to be associated with an Ingress. Currently Contour creates a CDS Cluster record for any Service object it learns about through the API, see #298. Each CDS record will cause Envoy to open a new EDS stream, one per Cluster, which can blow through the default limits that Envoy, as the gRPC client, and Contour, as the gRPC server, have set.

One of the easiest ways to detect if this issue is occuring in your cluster is too look for lines about "cluster warming"

[2018-04-03 03:34:16.920][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:388] add/update cluster test2/reverent-noether/80 starting warming
[2018-04-03 03:34:16.922][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:388] add/update cluster test2/serene-bohr/80 starting warming
[2018-04-03 03:34:16.924][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:388] add/update cluster test2/sleepy-hugle/80 starting warming

Without a matching "warming complete" message.

We believe we have addressed this issue, #291, and the fixes are available now to test.

These changes are in master now, and available in the gcr.io/heptio-images/contour:master image for you to try.

This has been backported the release-0.4 branch and are available in a short lived image gcr.io/heptio-images/contour:release-0.4 (it's not going to be deleted, but don't expect it to continue to be updated beyond the 0.4.1 release).

drobinson123 · 2018-04-03T21:33:07Z

The problem appears to have been fixed, at least in the limited testing I've done. Thanks @Lookyan & @davecheney !

davecheney · 2018-04-05T01:59:30Z

Thanks for confirming.

…293) operator does not create envoy service when envoy is ClusterIPService type and gatewayClassRef. This patch fixes it. Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>

davecheney mentioned this issue Mar 19, 2018

cmd/contour: integrate contourcli into contour docker image #294

Merged

davecheney added this to the 0.4.1 milestone Apr 3, 2018

davecheney self-assigned this Apr 3, 2018

davecheney mentioned this issue Apr 3, 2018

Backport gRPC circuit breaker fixes #314

Merged

davecheney added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 3, 2018

davecheney closed this as completed Apr 5, 2018

vn-patriot mentioned this issue May 22, 2019

Randomly 503 HTTP Status code over Envoy #1116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service endpoint changes not updated in envoy #293

Service endpoint changes not updated in envoy #293

drobinson123 commented Mar 19, 2018

davecheney commented Mar 19, 2018

drobinson123 commented Mar 20, 2018

drobinson123 commented Mar 20, 2018

davecheney commented Apr 3, 2018

drobinson123 commented Apr 3, 2018

davecheney commented Apr 5, 2018

Service endpoint changes not updated in envoy #293

Service endpoint changes not updated in envoy #293

Comments

drobinson123 commented Mar 19, 2018

davecheney commented Mar 19, 2018

drobinson123 commented Mar 20, 2018

drobinson123 commented Mar 20, 2018

davecheney commented Apr 3, 2018

drobinson123 commented Apr 3, 2018

davecheney commented Apr 5, 2018