Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service endpoint changes not updated in envoy #293

Closed
drobinson123 opened this issue Mar 19, 2018 · 6 comments
Closed

Service endpoint changes not updated in envoy #293

drobinson123 opened this issue Mar 19, 2018 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@drobinson123
Copy link

I'm seeing some weird issues with Contour 0.4. It seems like Contour configures Envoy correctly upon startup but then fails to keep envoy updated as resources change (service endpoints specifically). If I restart a contour pod it starts with the configuration I expect to see, and it routes requests correctly, that is until endpoints change. Here's what I see in the logs upon startup -- should I be concerned by the two "gRPC update" messages?

$ kubectl -n heptio-contour logs contour-gv9gk -c envoy -f
[2018-03-19 19:52:23.066][1][info][main] source/server/server.cc:178] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-03-19 19:52:23.072][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:128] cm init: initializing cds
[2018-03-19 19:52:23.073][1][info][config] source/server/configuration_impl.cc:52] loading 0 listener(s)
[2018-03-19 19:52:23.073][1][info][config] source/server/configuration_impl.cc:92] loading tracing configuration
[2018-03-19 19:52:23.073][1][info][config] source/server/configuration_impl.cc:119] loading stats sink configuration
[2018-03-19 19:52:23.073][1][info][main] source/server/server.cc:353] starting main dispatch loop
[2018-03-19 19:52:23.075][1][warning][upstream] source/common/config/grpc_mux_impl.cc:205] gRPC config stream closed: 1,
[2018-03-19 19:52:23.075][1][warning][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:66] gRPC update for type.googleapis.com/envoy.api.v2.Cluster failed
[2018-03-19 19:52:23.075][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-03-19 19:52:23.075][1][info][main] source/server/server.cc:337] all clusters initialized. initializing init manager
[2018-03-19 19:52:23.075][1][warning][upstream] source/common/config/grpc_mux_impl.cc:205] gRPC config stream closed: 1,
[2018-03-19 19:52:23.075][1][warning][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:66] gRPC update for type.googleapis.com/envoy.api.v2.Listener failed
[2018-03-19 19:52:23.075][1][info][config] source/server/listener_manager_impl.cc:583] all dependencies initialized. starting workers

Contour is running in AWS, with NLB, and TLS. It was deployed w/ the ds-hostnet config w/ the changes below (for TLS):

--- deployment/ds-hostnet/02-contour.yaml
+++ deployment/ds-hostnet/02-contour.yaml
@@ -28,8 +28,10 @@ spec:
         ports:
         - containerPort: 8080
           name: http
+        - containerPort: 8443
+          name: https
         command: ["envoy"]
-        args: ["-c", "/config/contour.yaml", "--service-cluster", "cluster0", "--service-node", "node0"]
+        args: ["-c", "/config/contour.yaml", "--service-cluster", "cluster0", "--service-node", "node0", "-l", "info", "--v2-config-only"]
         volumeMounts:
         - name: contour-config
           mountPath: /config

Snippet from envoy's /clusters endpoint when routing is broken:

admin/fleet/8080::default_priority::max_connections::1024
admin/fleet/8080::default_priority::max_pending_requests::1024
admin/fleet/8080::default_priority::max_requests::1024
admin/fleet/8080::default_priority::max_retries::3
admin/fleet/8080::high_priority::max_connections::1024
admin/fleet/8080::high_priority::max_pending_requests::1024
admin/fleet/8080::high_priority::max_requests::1024
admin/fleet/8080::high_priority::max_retries::3
admin/fleet/8080::added_via_api::true

Snippet from envoy's /clusters endpoint when routing is working:

admin/fleet/8080::default_priority::max_connections::1024
admin/fleet/8080::default_priority::max_pending_requests::1024
admin/fleet/8080::default_priority::max_requests::1024
admin/fleet/8080::default_priority::max_retries::3
admin/fleet/8080::high_priority::max_connections::1024
admin/fleet/8080::high_priority::max_pending_requests::1024
admin/fleet/8080::high_priority::max_requests::1024
admin/fleet/8080::high_priority::max_retries::3
admin/fleet/8080::added_via_api::true
admin/fleet/8080::100.96.1.166:8080::cx_active::0
admin/fleet/8080::100.96.1.166:8080::cx_connect_fail::0
admin/fleet/8080::100.96.1.166:8080::cx_total::0
admin/fleet/8080::100.96.1.166:8080::rq_active::0
admin/fleet/8080::100.96.1.166:8080::rq_error::0
admin/fleet/8080::100.96.1.166:8080::rq_success::0
admin/fleet/8080::100.96.1.166:8080::rq_timeout::0
admin/fleet/8080::100.96.1.166:8080::rq_total::0
admin/fleet/8080::100.96.1.166:8080::health_flags::healthy
admin/fleet/8080::100.96.1.166:8080::weight::1
admin/fleet/8080::100.96.1.166:8080::region::
admin/fleet/8080::100.96.1.166:8080::zone::
admin/fleet/8080::100.96.1.166:8080::sub_zone::
admin/fleet/8080::100.96.1.166:8080::canary::false
admin/fleet/8080::100.96.1.166:8080::success_rate::-1
@davecheney
Copy link
Contributor

Dont' worryt about those warnings during startup. Envoy starts before contour and tries to make a connection to contour which fails the first time (mostly).

davecheney added a commit to davecheney/contour that referenced this issue Mar 19, 2018
Updates projectcontour#293

This PR moves `cmd/contourcli` into the main `cmd/contour` binary so
that it can be used via kubectl exec.

Signed-off-by: Dave Cheney <dave@cheney.net>
@drobinson123
Copy link
Author

The contour cli eds command shows some interesting behavior.

Original state: kuard deployment has 3 pods (and 3 endpoints):

$ kubectl -n admin get deployment kuard
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kuard     3         3         3            3           4h

contour cli eds shows:

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.203"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.2.130"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.132"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>

And /clusters shows:

$ curl -s localhost:9001/clusters | grep kuard
admin/kuard/80::default_priority::max_connections::1024
admin/kuard/80::default_priority::max_pending_requests::1024
admin/kuard/80::default_priority::max_requests::1024
admin/kuard/80::default_priority::max_retries::3
admin/kuard/80::high_priority::max_connections::1024
admin/kuard/80::high_priority::max_pending_requests::1024
admin/kuard/80::high_priority::max_requests::1024
admin/kuard/80::high_priority::max_retries::3
admin/kuard/80::added_via_api::true
admin/kuard/80::100.96.1.203:8080::cx_active::0
admin/kuard/80::100.96.1.203:8080::cx_connect_fail::0
admin/kuard/80::100.96.1.203:8080::cx_total::0
admin/kuard/80::100.96.1.203:8080::rq_active::0
admin/kuard/80::100.96.1.203:8080::rq_error::0
admin/kuard/80::100.96.1.203:8080::rq_success::0
admin/kuard/80::100.96.1.203:8080::rq_timeout::0
admin/kuard/80::100.96.1.203:8080::rq_total::0
admin/kuard/80::100.96.1.203:8080::health_flags::healthy
admin/kuard/80::100.96.1.203:8080::weight::1
admin/kuard/80::100.96.1.203:8080::region::
admin/kuard/80::100.96.1.203:8080::zone::
admin/kuard/80::100.96.1.203:8080::sub_zone::
admin/kuard/80::100.96.1.203:8080::canary::false
admin/kuard/80::100.96.1.203:8080::success_rate::-1
admin/kuard/80::100.96.2.130:8080::cx_active::0
admin/kuard/80::100.96.2.130:8080::cx_connect_fail::0
admin/kuard/80::100.96.2.130:8080::cx_total::0
admin/kuard/80::100.96.2.130:8080::rq_active::0
admin/kuard/80::100.96.2.130:8080::rq_error::0
admin/kuard/80::100.96.2.130:8080::rq_success::0
admin/kuard/80::100.96.2.130:8080::rq_timeout::0
admin/kuard/80::100.96.2.130:8080::rq_total::0
admin/kuard/80::100.96.2.130:8080::health_flags::healthy
admin/kuard/80::100.96.2.130:8080::weight::1
admin/kuard/80::100.96.2.130:8080::region::
admin/kuard/80::100.96.2.130:8080::zone::
admin/kuard/80::100.96.2.130:8080::sub_zone::
admin/kuard/80::100.96.2.130:8080::canary::false
admin/kuard/80::100.96.2.130:8080::success_rate::-1
admin/kuard/80::100.96.3.132:8080::cx_active::0
admin/kuard/80::100.96.3.132:8080::cx_connect_fail::0
admin/kuard/80::100.96.3.132:8080::cx_total::0
admin/kuard/80::100.96.3.132:8080::rq_active::0
admin/kuard/80::100.96.3.132:8080::rq_error::0
admin/kuard/80::100.96.3.132:8080::rq_success::0
admin/kuard/80::100.96.3.132:8080::rq_timeout::0
admin/kuard/80::100.96.3.132:8080::rq_total::0
admin/kuard/80::100.96.3.132:8080::health_flags::healthy
admin/kuard/80::100.96.3.132:8080::weight::1
admin/kuard/80::100.96.3.132:8080::region::
admin/kuard/80::100.96.3.132:8080::zone::
admin/kuard/80::100.96.3.132:8080::sub_zone::
admin/kuard/80::100.96.3.132:8080::canary::false

Then upon deleting the deployment I have 0 pods but 2 endpoints:

$ kubectl -n admin delete deployment kuard
deployment "kuard" deleted
$ kubectl -n admin get deployment kuard
Error from server (NotFound): deployments.extensions "kuard" not found
resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.203"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.2.130"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>
$ curl -s localhost:9001/clusters | grep kuard
admin/kuard/80::default_priority::max_connections::1024
admin/kuard/80::default_priority::max_pending_requests::1024
admin/kuard/80::default_priority::max_requests::1024
admin/kuard/80::default_priority::max_retries::3
admin/kuard/80::high_priority::max_connections::1024
admin/kuard/80::high_priority::max_pending_requests::1024
admin/kuard/80::high_priority::max_requests::1024
admin/kuard/80::high_priority::max_retries::3
admin/kuard/80::added_via_api::true
admin/kuard/80::100.96.1.203:8080::cx_active::0
admin/kuard/80::100.96.1.203:8080::cx_connect_fail::0
admin/kuard/80::100.96.1.203:8080::cx_total::0
admin/kuard/80::100.96.1.203:8080::rq_active::0
admin/kuard/80::100.96.1.203:8080::rq_error::0
admin/kuard/80::100.96.1.203:8080::rq_success::0
admin/kuard/80::100.96.1.203:8080::rq_timeout::0
admin/kuard/80::100.96.1.203:8080::rq_total::0
admin/kuard/80::100.96.1.203:8080::health_flags::healthy
admin/kuard/80::100.96.1.203:8080::weight::1
admin/kuard/80::100.96.1.203:8080::region::
admin/kuard/80::100.96.1.203:8080::zone::
admin/kuard/80::100.96.1.203:8080::sub_zone::
admin/kuard/80::100.96.1.203:8080::canary::false
admin/kuard/80::100.96.1.203:8080::success_rate::-1
admin/kuard/80::100.96.2.130:8080::cx_active::0
admin/kuard/80::100.96.2.130:8080::cx_connect_fail::0
admin/kuard/80::100.96.2.130:8080::cx_total::0
admin/kuard/80::100.96.2.130:8080::rq_active::0
admin/kuard/80::100.96.2.130:8080::rq_error::0
admin/kuard/80::100.96.2.130:8080::rq_success::0
admin/kuard/80::100.96.2.130:8080::rq_timeout::0
admin/kuard/80::100.96.2.130:8080::rq_total::0
admin/kuard/80::100.96.2.130:8080::health_flags::healthy
admin/kuard/80::100.96.2.130:8080::weight::1
admin/kuard/80::100.96.2.130:8080::region::
admin/kuard/80::100.96.2.130:8080::zone::
admin/kuard/80::100.96.2.130:8080::sub_zone::
admin/kuard/80::100.96.2.130:8080::canary::false
admin/kuard/80::100.96.2.130:8080::success_rate::-1

@drobinson123
Copy link
Author

cli eds still shows the 2 endpoints when the svc is deleted, but /clusters does not.

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.203"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.2.130"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>
$ curl -s localhost:9001/clusters | grep kuard
$

I removed the ingress then created the svc and deployment again. cli eds showed 1, then 2, then all 3 endpoints, but they are still missing from /clusters.

resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.136"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>
resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.205"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.136"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>
resources: <
  [type.googleapis.com/envoy.api.v2.ClusterLoadAssignment]: <
    cluster_name: "admin/kuard"
    endpoints: <
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.1.205"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.3.136"
              port_value: 8080
            >
          >
        >
      >
      lb_endpoints: <
        endpoint: <
          address: <
            socket_address: <
              address: "100.96.5.39"
              port_value: 8080
            >
          >
        >
      >
    >
  >
>
$ curl -s localhost:9001/clusters | grep kuard
admin/kuard/80::default_priority::max_connections::1024
admin/kuard/80::default_priority::max_pending_requests::1024
admin/kuard/80::default_priority::max_requests::1024
admin/kuard/80::default_priority::max_retries::3
admin/kuard/80::high_priority::max_connections::1024
admin/kuard/80::high_priority::max_pending_requests::1024
admin/kuard/80::high_priority::max_requests::1024
admin/kuard/80::high_priority::max_retries::3
admin/kuard/80::added_via_api::true

Lookyan pushed a commit to Lookyan/contour that referenced this issue Mar 30, 2018
Updates projectcontour#293

This PR moves `cmd/contourcli` into the main `cmd/contour` binary so
that it can be used via kubectl exec.

Signed-off-by: Dave Cheney <dave@cheney.net>
@davecheney davecheney added this to the 0.4.1 milestone Apr 3, 2018
@davecheney davecheney self-assigned this Apr 3, 2018
@davecheney
Copy link
Contributor

Thanks to Alexander Lukyanchenko (@Lookyan) we have increased the general gRPC limits on both the Envoy client and Contour server well above anything that should be an issue for the immediate future.

The symptoms of hitting gRPC limits vary, but are basically "envoy doesn't see changes in the API server until I restart ". The underlying cause is likely to be that you have a large (more than 100, possibly 200, the exact limit is not precisely known) number of Service objects in your cluster -- these don't have to be associated with an Ingress. Currently Contour creates a CDS Cluster record for any Service object it learns about through the API, see #298. Each CDS record will cause Envoy to open a new EDS stream, one per Cluster, which can blow through the default limits that Envoy, as the gRPC client, and Contour, as the gRPC server, have set.

One of the easiest ways to detect if this issue is occuring in your cluster is too look for lines about "cluster warming"

[2018-04-03 03:34:16.920][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:388] add/update cluster test2/reverent-noether/80 starting warming
[2018-04-03 03:34:16.922][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:388] add/update cluster test2/serene-bohr/80 starting warming
[2018-04-03 03:34:16.924][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:388] add/update cluster test2/sleepy-hugle/80 starting warming

Without a matching "warming complete" message.

We believe we have addressed this issue, #291, and the fixes are available now to test.

These changes are in master now, and available in the gcr.io/heptio-images/contour:master image for you to try.

This has been backported the release-0.4 branch and are available in a short lived image gcr.io/heptio-images/contour:release-0.4 (it's not going to be deleted, but don't expect it to continue to be updated beyond the 0.4.1 release).

@davecheney davecheney added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 3, 2018
@drobinson123
Copy link
Author

The problem appears to have been fixed, at least in the limited testing I've done. Thanks @Lookyan & @davecheney !

@davecheney
Copy link
Contributor

Thanks for confirming.

sunjayBhatia pushed a commit that referenced this issue Jan 30, 2023
…293)

operator does not create envoy service when envoy
is ClusterIPService type and gatewayClassRef.

This patch fixes it.

Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

2 participants