emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725

fs185143 · 2024-07-19T11:44:08Z

Describe the bug

emissary-ingress running on fairly demanding environment with many resources/mappings/services etc.,
Start seeing warnings such as:

"Memory Usage: throttling reconfig v2854 due to constrained memory with 112 stale reconfigs (60 max)"
...
"Memory Usage: throttling reconfig v3477 due to constrained memory with 123 stale reconfigs (120 max)"

Modest memory usage

"Memory Usage 0.32Gi (36%)\n    PID 1, 0.14Gi: busyambassador entrypoint \n    PID 22, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 23, 0.16Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 34, 0.09Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 600 -l error \n    PID 38405, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38423, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38441, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38471, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate "

ambassador container receives SIGTERM

"time="2024-07-18 11:54:53.2387" level=error msg="goroutine \":signal_handler:0\" exited with error: received signal terminated (triggering graceful shutdown)" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=":signal_handler:0""
"shutting down (gracefully)...
...
"    :signal_handler:0        : exited with error"
"time="2024-07-18 11:54:55.1160" level=error msg="shut down with error error: received signal terminated (triggering graceful shutdown)" func=github.com/emissary-ingress/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint PID=1"
"time="2024-07-18 22:08:49.9497" level=error msg="goroutine \":signal_handler:0\" exited with error: received signal terminated (triggering graceful shutdown)" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=":signal_handler:0""

A new emissary-ingress pod is created and jumps right into

time="2024-07-19 05:44:00.3312" level=info msg="Memory Usage 0.35Gi (39%)\n    PID 1, 0.16Gi: busyambassador entrypoint \n    PID 25, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 26, 0.16Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 37, 0.10Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 600 -l error \n    PID 31820, 0.09Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31850, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31892, 0.04Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31946, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate " func="github.com/emissary-ingress/emissary/v3/pkg/memory.(*MemoryUsage).Watch.func1" file="/go/pkg/memory/memory.go:39" CMD=entrypoint PID=1 THREAD=/memory
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: EnvoyConfig: Generating V3
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: V3Listener <V3Listener HTTP emissary-listener on 0.0.0.0:8888 [XFP]>: generated ===========================
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: V3Ready: ==== listen on 127.0.0.1:8006
2024-07-19 05:44:23 diagd 3.9.1 [P26TThreadPoolExecutor-0_1] INFO: 41A43E42-219C-4FBE-9C8B-9128C3123F01: 127.0.0.1 "GET /metrics" 658ms 200 success
2024-07-19 05:44:24 diagd 3.9.1 [P26TAEW] INFO: configuration updated (incremental) from snapshot snapshot (S33 L1 G311 C311)
time="2024-07-19 05:44:24.4078" level=info msg="Loaded file /ambassador/envoy/envoy.json" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.Decode file="/go/pkg/ambex/main.go:281" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2024-07-19 05:44:24.5576" level=info msg="Saved snapshot v5318" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.csDump file="/go/pkg/ambex/main.go:351" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2024-07-19 05:44:24.5660" level=info msg="Pushing snapshot v5318" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex/updater
...

No longer works, e.g., gives errors such as these where previously they would be 30x or 200.

ACCESS [2024-07-19T05:44:22.929Z] "GET /grafana/metrics HTTP/1.1" 503 UH 0 4172 16 - "<ip>,<ip>,<ip>" "Prometheus/2.35.0-gmp.4" "<uuid>" "<uuid>.<domain>" "-"

ACCESS [2024-07-19T09:59:29.493Z] "GET <path> HTTP/1.1" 403 UAEX 0 3984 0 - "<ip>, <ip>,<ip>, <ip>, <ip>,<ip>" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0" <uuid>" "<uuid>.<domain>" "-"

To Reproduce
Can work on providing a more concrete example if needed, but we are observing this after around 12-24h on a cluster as described.

Expected behaviour
Should be able to recover after being replaced. We have observed the SIGTERM due to a promotion which is expected, but also SIGKILL due to memory usage - the latter seems to have been fixed after increasing our resource limits.

Versions (please complete the following information):

Ambassador: 3.9.1
Kubernetes environment: GKE
Version: v1.27.14-gke.1042001

Additional context
Possibly related:

Some relevant configuration options:

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi

The text was updated successfully, but these errors were encountered:

kflynn · 2024-07-22T18:19:29Z

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

cindymullins-dw · 2024-07-22T18:31:58Z

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOT_COUNT is met.

fs185143 · 2024-07-23T09:24:26Z

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOT_COUNT is met.

thanks - we will try this on our emissary/emissary-ingress deployment

spec:
  template:
    spec:
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5" # default 30
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300" # default 600

fs185143 · 2024-07-23T09:25:21Z

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

I believe it comes under emissary as we are not using the enterprise application

fs185143 · 2024-08-06T12:52:53Z

We are now seeing these ambassador logs

Memory Usage 0.85Gi (96%)\n    PID 1, 0.14Gi: busyambassador entrypoint \n    PID 16, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 18, 0.34Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 29, 0.11Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 300 -l error \n    PID 246459, 0.10Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246489, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246525, 0.09Gi: envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate
...
time="2024-08-04 13:55:43.4550" level=warning msg="Memory Usage: throttling reconfig v41082 due to constrained memory with 15 stale reconfigs (1 max)" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:140" CMD=entrypoint PID=1 THREAD=/ambex/updater

which caused the emissary-ingress pod to restart after running for 3 days, and post-restart it only gives 403 errors.

This is with the following config

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5"
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300"
        - name: SCOUT_DISABLE
          value: "1"
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi

Would appreciate any advice/support - thanks.

I am wondering if we might just need to increase the memory limit, but also not sure why it gives the ACCESS 403 errors after it restarts?

fs185143 · 2024-08-06T13:20:57Z

Going to try:

Lower AMBASSADOR_AMBEX_SNAPSHOT_COUNT from 5 to 0
Re-enable AMBASSADOR_FAST_RECONFIGURE
Increase AMBASSADOR_RECONFIG_MAX_DELAY from 1 to 3
Increase memory limit from 900Mi to 2000Mi
Increase memory request from 600Mi to 1000Mi

Not sure if it's worth changing AMBASSADOR_SNAPSHOT_COUNT?

fs185143 · 2024-08-06T14:40:50Z

Could the 403s be due to some cert/auth issue that occurs after restart perhaps?

Maybe related to:

fs185143 · 2024-08-07T12:50:29Z

OK so I think it's actually more related to #5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms

fs185143 · 2024-08-08T14:36:20Z

OK so I think it's actually more related to #5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms

Do we think this may have been fixed by a version of envoy > 1.27.2? e.g.,

https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.28/v1.28.0#bug-fixes
- dns: Fixed a bug where when respect_dns_ttl was set to true, c-ares dns resolver only considered address record for TTL calculation while ignoring CNAME records TTL. Now when respect_dns_ttl is set to true minimum of all TTL records is considered.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.29/v1.29.0#bug-fixes
- DNS: Fixed a race condition that when multiple requests with the same authority header are sent to Envoy, sometimes some requests may receive 503 response with no_healthy_upstream from Envoy. The fix is guarded by runtime guard envoy.reloadable_features.dns_cache_set_first_resolve_complete, which defaults to true.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.30/v1.30.0#minor-behavior-changes
- dns: Fixing a bug in the getaddrinfo resolver where it did not reresolve on EAI_AGAIN. This behavioral change can be temporarily reverted by setting envoy.reloadable_features.dns_reresolve_on_eai_again to false.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.31/v1.31.0#minor-behavior-changes
- dns: Changes the behavior of the getaddrinfo DNS resolver so that it treats EAI_NODATA and EAI_NONAME as successful queries with empty results, instead of as DNS failures. This change brings the getaddrinfo behavior in-line with the c-ares resolver behavior. This behavior can be reverted by setting the runtime guard envoy.reloadable_features.dns_nodata_noname_is_success to false.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.31/v1.31.0#bug-fixes
- cares: Upgraded c-ares library to 1.20.1 and added fix to c-ares DNS implementation to additionally check for ARES_EREFUSED, ARES_ESERVFAIL and ARES_ENOTIMP status. Without this fix, DestroyChannelOnRefused and CustomResolverValidAfterChannelDestruction unit test will break.

kflynn · 2024-08-08T15:17:47Z

@fs185143 That's definitely the first thing to try, which is part of why I'm trying to do a new 3.10 release. 🙂

Note that there's a development branch of Emissary 4 that's already running Envoy 1.30: if you're interesting in testing that to see how it behaves here, let me know.

fs185143 · 2024-09-19T13:42:34Z

Closing in favour of more targeted issue #5785

dosubot bot added the t:bug Something isn't working label Jul 19, 2024

fs185143 closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725

emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725

fs185143 commented Jul 19, 2024 •

edited

Loading

kflynn commented Jul 22, 2024

cindymullins-dw commented Jul 22, 2024

fs185143 commented Jul 23, 2024 •

edited

Loading

fs185143 commented Jul 23, 2024

fs185143 commented Aug 6, 2024 •

edited

Loading

fs185143 commented Aug 6, 2024 •

edited

Loading

fs185143 commented Aug 6, 2024 •

edited

Loading

fs185143 commented Aug 7, 2024

fs185143 commented Aug 8, 2024

kflynn commented Aug 8, 2024

fs185143 commented Sep 19, 2024

emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725

emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725

Comments

fs185143 commented Jul 19, 2024 • edited Loading

kflynn commented Jul 22, 2024

cindymullins-dw commented Jul 22, 2024

fs185143 commented Jul 23, 2024 • edited Loading

fs185143 commented Jul 23, 2024

fs185143 commented Aug 6, 2024 • edited Loading

fs185143 commented Aug 6, 2024 • edited Loading

fs185143 commented Aug 6, 2024 • edited Loading

fs185143 commented Aug 7, 2024

fs185143 commented Aug 8, 2024

kflynn commented Aug 8, 2024

fs185143 commented Sep 19, 2024

fs185143 commented Jul 19, 2024 •

edited

Loading

fs185143 commented Jul 23, 2024 •

edited

Loading

fs185143 commented Aug 6, 2024 •

edited

Loading

fs185143 commented Aug 6, 2024 •

edited

Loading

fs185143 commented Aug 6, 2024 •

edited

Loading