Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725

Closed
fs185143 opened this issue Jul 19, 2024 · 11 comments
Closed
Labels
t:bug Something isn't working

Comments

@fs185143
Copy link

fs185143 commented Jul 19, 2024

Describe the bug

  • emissary-ingress running on fairly demanding environment with many resources/mappings/services etc.,
  • Start seeing warnings such as:
"Memory Usage: throttling reconfig v2854 due to constrained memory with 112 stale reconfigs (60 max)"
...
"Memory Usage: throttling reconfig v3477 due to constrained memory with 123 stale reconfigs (120 max)"
  • Modest memory usage
"Memory Usage 0.32Gi (36%)\n    PID 1, 0.14Gi: busyambassador entrypoint \n    PID 22, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 23, 0.16Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 34, 0.09Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 600 -l error \n    PID 38405, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38423, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38441, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38471, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate "
  • ambassador container receives SIGTERM
"time="2024-07-18 11:54:53.2387" level=error msg="goroutine \":signal_handler:0\" exited with error: received signal terminated (triggering graceful shutdown)" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=":signal_handler:0""
"shutting down (gracefully)...
...
"    :signal_handler:0        : exited with error"
"time="2024-07-18 11:54:55.1160" level=error msg="shut down with error error: received signal terminated (triggering graceful shutdown)" func=github.com/emissary-ingress/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint PID=1"
"time="2024-07-18 22:08:49.9497" level=error msg="goroutine \":signal_handler:0\" exited with error: received signal terminated (triggering graceful shutdown)" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=":signal_handler:0""
  • A new emissary-ingress pod is created and jumps right into
time="2024-07-19 05:44:00.3312" level=info msg="Memory Usage 0.35Gi (39%)\n    PID 1, 0.16Gi: busyambassador entrypoint \n    PID 25, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 26, 0.16Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 37, 0.10Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 600 -l error \n    PID 31820, 0.09Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31850, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31892, 0.04Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31946, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate " func="github.com/emissary-ingress/emissary/v3/pkg/memory.(*MemoryUsage).Watch.func1" file="/go/pkg/memory/memory.go:39" CMD=entrypoint PID=1 THREAD=/memory
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: EnvoyConfig: Generating V3
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: V3Listener <V3Listener HTTP emissary-listener on 0.0.0.0:8888 [XFP]>: generated ===========================
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: V3Ready: ==== listen on 127.0.0.1:8006
2024-07-19 05:44:23 diagd 3.9.1 [P26TThreadPoolExecutor-0_1] INFO: 41A43E42-219C-4FBE-9C8B-9128C3123F01: 127.0.0.1 "GET /metrics" 658ms 200 success
2024-07-19 05:44:24 diagd 3.9.1 [P26TAEW] INFO: configuration updated (incremental) from snapshot snapshot (S33 L1 G311 C311)
time="2024-07-19 05:44:24.4078" level=info msg="Loaded file /ambassador/envoy/envoy.json" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.Decode file="/go/pkg/ambex/main.go:281" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2024-07-19 05:44:24.5576" level=info msg="Saved snapshot v5318" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.csDump file="/go/pkg/ambex/main.go:351" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2024-07-19 05:44:24.5660" level=info msg="Pushing snapshot v5318" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex/updater
...
  • No longer works, e.g., gives errors such as these where previously they would be 30x or 200.
ACCESS [2024-07-19T05:44:22.929Z] "GET /grafana/metrics HTTP/1.1" 503 UH 0 4172 16 - "<ip>,<ip>,<ip>" "Prometheus/2.35.0-gmp.4" "<uuid>" "<uuid>.<domain>" "-"

ACCESS [2024-07-19T09:59:29.493Z] "GET <path> HTTP/1.1" 403 UAEX 0 3984 0 - "<ip>, <ip>,<ip>, <ip>, <ip>,<ip>" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0" <uuid>" "<uuid>.<domain>" "-"

To Reproduce
Can work on providing a more concrete example if needed, but we are observing this after around 12-24h on a cluster as described.

Expected behaviour
Should be able to recover after being replaced. We have observed the SIGTERM due to a promotion which is expected, but also SIGKILL due to memory usage - the latter seems to have been fixed after increasing our resource limits.

Versions (please complete the following information):

  • Ambassador: 3.9.1
  • Kubernetes environment: GKE
  • Version: v1.27.14-gke.1042001

Additional context
Possibly related:

Some relevant configuration options:

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi
@dosubot dosubot bot added the t:bug Something isn't working label Jul 19, 2024
@kflynn
Copy link
Member

kflynn commented Jul 22, 2024

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

@cindymullins-dw
Copy link
Contributor

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOT_COUNT is met.

@fs185143
Copy link
Author

fs185143 commented Jul 23, 2024

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOT_COUNT is met.

thanks - we will try this on our emissary/emissary-ingress deployment

spec:
  template:
    spec:
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5" # default 30
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300" # default 600

@fs185143
Copy link
Author

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

I believe it comes under emissary as we are not using the enterprise application

@fs185143
Copy link
Author

fs185143 commented Aug 6, 2024

We are now seeing these ambassador logs

Memory Usage 0.85Gi (96%)\n    PID 1, 0.14Gi: busyambassador entrypoint \n    PID 16, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 18, 0.34Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 29, 0.11Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 300 -l error \n    PID 246459, 0.10Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246489, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246525, 0.09Gi: envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate
...
time="2024-08-04 13:55:43.4550" level=warning msg="Memory Usage: throttling reconfig v41082 due to constrained memory with 15 stale reconfigs (1 max)" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:140" CMD=entrypoint PID=1 THREAD=/ambex/updater

which caused the emissary-ingress pod to restart after running for 3 days, and post-restart it only gives 403 errors.

This is with the following config

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5"
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300"
        - name: SCOUT_DISABLE
          value: "1"
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi

Would appreciate any advice/support - thanks.

I am wondering if we might just need to increase the memory limit, but also not sure why it gives the ACCESS 403 errors after it restarts?

@fs185143
Copy link
Author

fs185143 commented Aug 6, 2024

Going to try:

  • Lower AMBASSADOR_AMBEX_SNAPSHOT_COUNT from 5 to 0
  • Re-enable AMBASSADOR_FAST_RECONFIGURE
  • Increase AMBASSADOR_RECONFIG_MAX_DELAY from 1 to 3
  • Increase memory limit from 900Mi to 2000Mi
  • Increase memory request from 600Mi to 1000Mi

Not sure if it's worth changing AMBASSADOR_SNAPSHOT_COUNT?

@fs185143
Copy link
Author

fs185143 commented Aug 6, 2024

Could the 403s be due to some cert/auth issue that occurs after restart perhaps?

Maybe related to:

@fs185143
Copy link
Author

fs185143 commented Aug 7, 2024

OK so I think it's actually more related to #5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms

@fs185143
Copy link
Author

fs185143 commented Aug 8, 2024

OK so I think it's actually more related to #5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms

Do we think this may have been fixed by a version of envoy > 1.27.2? e.g.,

@kflynn
Copy link
Member

kflynn commented Aug 8, 2024

@fs185143 That's definitely the first thing to try, which is part of why I'm trying to do a new 3.10 release. 🙂

Note that there's a development branch of Emissary 4 that's already running Envoy 1.30: if you're interesting in testing that to see how it behaves here, let me know.

@fs185143
Copy link
Author

Closing in favour of more targeted issue #5785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants