Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't achieve graceful shutdown of pods using Ambassador+ConsulConnect #3264

Closed
etotten opened this issue Mar 9, 2021 · 3 comments
Closed

Comments

@etotten
Copy link

etotten commented Mar 9, 2021

Describe the bug
I tried various ways of getting to the point of full gracefulness when either:

a) doing a rolling deploy (and thus pods come down 1 at a time)
b) doing a scale down (and thus a bunch of pods come down simultaneously)

...but I have not found a way to make it so that 0 requests get dropped when pods terminate. On the flip side, for pod launch (e.g. scale up), we do better - some combination of the health checks and perhaps delays are making that operation work fairly gracefully in my testing. Termination is the problem.

I am looking for any suggestions; any help would be hugely appreciated.

To Reproduce
This is kind of involved, but here's the setup:

  • have a test service named brts running in a Consul Connect cluster
    • brts has an envoy sidecar-proxy container running in it's pod that is created by a connect-injector
    • so the sidecar is registered as a consul service
    • the brts pods are not associated with a k8s service
  • have Ambassador 1.12.0 running in the k8s cluster via helm chart v6.5.2
    • using a ConsulResolver pointing at Host ${HOST_IP}:8500
    • launching an AWS NLB via the Ambassador helm chart
    • running a single Ambassador pod for the moment to make visibility easier
    • we do run separate internal and external ambassador instances, so I'm just testing ambassador-internal here
  • have a SlowCooker (buoyant) container running outside the k8s cluster sending traffic through the NLB (and thus to Ambassador->mesh->pods)
    • for the moment, running a concurrency of 10 with 20 qps each (so 200 qps total)

Results:

  • when running brts with 2 pods and doing a rolling deploy that replaces them one at a time, I am seeing 10-15% of requests dropped in a 10sec period for each; then this cleans-up some seconds afterwards.
  • when doing a scale down from 10 pods to 2 pods, I see 33%-50% of requests get dropped over a 10 sec period (or longer)

Expected behavior
To not have errors (HTTP 503's) for requests during the termination of a pod running in Consul Connect

Versions (please complete the following information):

  • testing AES 1.12.0 with legacy mode on (we have been using 1.4.1 for a while) - tried 1.11.2 (but had similar problems)
  • testing Consul 1.9.3 (will be upgrading from 1.7.3 so that we can get the k8s health-check sync'ing feature)
  • k8s is AWS EKS, version 1.18

Additional context
What Kinds of Things Have I Tried:

  • adding a preStop hook on the main brts container of 3, 5, and 10 seconds
  • altering the existing preStop hook on the envoy proxy container (which requires hacking consul-k8s, building, pushing up a container and such) to have a "&& sleep X" of 3, 5, and 10 seconds appended to the end of the service deregister call.
    • Here I was thinking maybe the deregistration of the envoy sidecar was making requests 503 if Ambassador was still sending them to the pod
  • having one or both of the above preStop hooks in place
  • checking the Ambassador logs for any mention of recognizing that something has changed

I suspect that AES is not keeping up with the mesh changes. In looking through the AES debug logs, I find lines like this:

[2021-03-09 01:43:31.212][225][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1069] membership update for TLS cluster cluster_bedrock_brts_sidecar_proxy_otls_-0 added 1 removed 0

...and if I tail | grep for those while the deploy or scale down is happening, I see that these logs can come many seconds later. I can't say that the 503's stop happening right when these logs come (which would be evidence), but I'm of course grasping at straws a bit.

WHAT ABOUT AMBASSADOR_FAST_RECONFIGURE?

Logs of non-functional AES 1.12.0 w/ AMBASSADOR_FAST_RECONFIGURE: true:

time="2021-03-09 19:32:04" level=info msg="Started Ambassador" func=github.com/datawire/ambassador/cmd/entrypoint.Main file="github.com/datawire/ambassador/cmd/entrypoint/entrypoint.go:91" CMD=entrypoint PID=1
time="2021-03-09 19:32:04" level=info msg="AMBASSADOR_CLUSTER_ID=594b244d-c537-5190-9bcf-2a82b00c0b78" func=github.com/datawire/ambassador/cmd/entrypoint.Main file="github.com/datawire/ambassador/cmd/entrypoint/entrypoint.go:105" CMD=entrypoint PID=1
time="2021-03-09 19:32:04" level=info msg="[pid:13] started command []string{\"/ambassador/sidecars/amb-sidecar\"}" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:143" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stdin  < not logging input read from file /dev/stdin" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:145" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stdout > not logging output written to file /dev/stdout" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:148" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stderr > not logging output written to file /dev/stderr" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:151" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:14] started command []string{\"diagd\", \"/ambassador/snapshots\", \"/ambassador/bootstrap-ads.json\", \"/ambassador/envoy/envoy.json\", \"--notices\", \"/ambassador/notices.json\", \"--port\", \"8004\", \"--kick\", \"kill -HUP 1\"}" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:143" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stdin  < not logging input read from file /dev/stdin" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:145" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stdout > not logging output written to file /dev/stdout" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:148" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stderr > not logging output written to file /dev/stderr" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:151" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:05" level=info msg="Ambassador Edge Stack configuration loaded" func=github.com/datawire/apro/cmd/amb-sidecar.runE file="github.com/datawire/apro/cmd/amb-sidecar/main.go:123" CMD=amb-sidecar PID=13
time="2021-03-09T19:32:05Z" level=warning msg="statsd is not in use"
time="2021-03-09 19:32:05" level=error msg="Failed to create watch on /tmp/ambassador/.config/ambassador/: Changes might require a restart: no such file or directory" func=github.com/datawire/apro/cmd/amb-sidecar.triggerOnChange file="github.com/datawire/apro/cmd/amb-sidecar/files.go:56" CMD=amb-sidecar PID=13 THREAD=/license_refresh
time="2021-03-09 19:32:05" level=error msg="0 filters configured" func="github.com/datawire/apro/cmd/amb-sidecar/filters/controller.(*Controller).Watch.func1" file="github.com/datawire/apro/cmd/amb-sidecar/filters/controller/controller.go:150" CMD=amb-sidecar PID=13 THREAD=/auth_controller
2021/03/09 19:32:06 http: proxy error: dial tcp 127.0.0.1:8004: connect: connection refused
time="2021-03-09 19:32:06" level=error msg="Bad HTTP request: status_code=502" func="github.com/datawire/apro/cmd/amb-sidecar/devportal/server.(*DevPortalHTTPClient).Get.func1" file="github.com/datawire/apro/cmd/amb-sidecar/devportal/server/http.go:70" CMD=amb-sidecar PID=13 component=devportal mhost= url="http://127.0.0.1:8877/ambassador/v0/diag/?json=true"
time="2021-03-09 19:32:06" level=error msg="HTTP error 502 from http://127.0.0.1:8877/ambassador/v0/diag/?json=true" func="github.com/datawire/apro/cmd/amb-sidecar/devportal/server.(*DevPortalHTTPClient).Get" file="github.com/datawire/apro/cmd/amb-sidecar/devportal/server/http.go:77" CMD=amb-sidecar PID=13 component=devportal mhost= url="http://127.0.0.1:8877/ambassador/v0/diag/?json=true"
I0309 19:32:06.237957       1 request.go:621] Throttling request took 1.012909724s, request: GET:https://172.20.0.1:443/apis/getambassador.io/v2/devportals?limit=500&resourceVersion=0
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: AMBASSADOR_FAST_RECONFIGURE enabled, initializing cache
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: WILL NOT update Mapping status
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: thread count 5, listening on 127.0.0.1:8004
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-voyager-tottene-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/sis_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/brss-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: Ambassador 1.12.0 booted
[2021-03-09 19:32:07 +0000] [14] [INFO] Starting gunicorn 20.0.4
[2021-03-09 19:32:07 +0000] [14] [INFO] Listening at: http://127.0.0.1:8004 (14)
[2021-03-09 19:32:07 +0000] [14] [INFO] Using worker: threads
[2021-03-09 19:32:07 +0000] [28] [INFO] Booting worker with pid: 28
2021-03-09 19:32:07 diagd 1.12.0 [P28TAEW] INFO: starting Scout checker and timer logger
2021-03-09 19:32:07 diagd 1.12.0 [P28TAEW] INFO: starting event watcher
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-voyager-tottene-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/sis_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/brss-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
@etotten
Copy link
Author

etotten commented Apr 19, 2021

Update on this:

We wanted to see if the AMBASSADOR_FAST_RECONFIGURE variable was likely to help us, if we could get it to run. So, for testing purposes only, managed to hack around the error mentioned above

Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true

...by setting a specific IP address in the Consul Resolver k8s resource instead of ${HOST_IP}. This worked, and we were able to disable AMBASSADOR_LEGACY_MODE and set AMBASSADOR_FAST_RECONFIGURE: true. This is of course not a viable work-around, but at least it allows us to try it out in a test environment with a single replica of Ambassador.

With that test environment, we were able to see that enabling AMBASSADOR_FAST_RECONFIGURE and a preStop handler on our deployment for our service-under-test:

        lifecycle:
          preStop:
            exec:
              command: ["/bin/sleep","5"]

...we were able to see that AMBASSADOR_FAST_RECONFIGURE does indeed decrease the number of 503's we see happening when pods terminate. In a lot of cases, it's zero 503's that happen, but sometimes is 1 or 2 when pods go down.

So that's a significant improvement at least. I experimented with a retry_policy for "5xx" on mappings that use the Consul Resolver, but didn't see that push the failures to zero. Need to do more testing in that area.

I see that #3182 is merged and slotted for 1.13, so I will close this once 1.13 is out and confirmed to work with the Consul Resolver.

@etotten
Copy link
Author

etotten commented Aug 9, 2021

Update:

  • In the previous comment, where I mention that AMBASSADOR_FAST_RECONFIGURE helped us - that turned out to be false. The better results in testing were likely just coincidental.
  • However, we have identified factors that significantly improve the gracefulness of Ambassador request handling during pod termination:
    • Ensure that Ambassador's memory stays low
    • Ensure there's enough CPU on the envoy-sidecar containers of your app's pods
      • this is specific to using consul connect. From a while back, we had a CPU limit of only 20m for these sidecars, and this caused it to take like 5 or 6 seconds for this sidecar to trigger a Deregister from Consul Connect after being terminated
      • there was a lot of request latency and instability during this period
      • with 100m for the envoy sidecars, we see ~1 sec for the pods to trigger a Deregister in consul
    • Don't use AMBASSADOR_LEGACY_MODE
      • as this dramatically increases the amount of time we saw for Ambassador to recognize membership changes after the Consul Connect Deregistration happened for the target app's envoy-sidecar.
      • Specifically, it was about 5 seconds before the "membership update for TLS cluster" trace log would come through, whereas with AMBASSADOR_LEGACY_MODE=false, the membership updates would happen in less than 100ms after the Consul Deregister, and often less than 50ms.

@etotten
Copy link
Author

etotten commented Aug 10, 2021

With the learnings posted above, the problems mentioned directly in this issue are mitigated. This can now be closed.

There are further issues that we see creating instability in our upstreams, but those are handled more specifically in other issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant