-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725
Comments
@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂) |
Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT. A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOT_COUNT is met. |
thanks - we will try this on our spec:
template:
spec:
containers:
- name: ambassador
env:
- name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
value: "5" # default 30
- name: AMBASSADOR_FAST_RECONFIGURE
value: "false"
- name: AMBASSADOR_DRAIN_TIME
value: "300" # default 600 |
I believe it comes under emissary as we are not using the enterprise application |
We are now seeing these
which caused the This is with the following config spec:
template:
spec:
terminationGracePeriodSeconds: 90
containers:
- name: ambassador
env:
- name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
value: "5"
- name: AMBASSADOR_FAST_RECONFIGURE
value: "false"
- name: AMBASSADOR_DRAIN_TIME
value: "300"
- name: SCOUT_DISABLE
value: "1"
lifecycle:
preStop:
exec:
command:
- sleep 60
livenessProbe:
failureThreshold: 6
httpGet:
path: /ambassador/v0/check_alive
port: admin
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 6
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 6
httpGet:
path: /ambassador/v0/check_ready
port: admin
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 6
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "1"
memory: 900Mi
requests:
cpu: 200m
memory: 600Mi Would appreciate any advice/support - thanks. I am wondering if we might just need to increase the memory limit, but also not sure why it gives the |
Going to try:
Not sure if it's worth changing |
Could the 403s be due to some cert/auth issue that occurs after restart perhaps? Maybe related to: |
OK so I think it's actually more related to #5564 We are seeing these logs after enabling debug
|
Do we think this may have been fixed by a version of envoy > 1.27.2? e.g.,
|
@fs185143 That's definitely the first thing to try, which is part of why I'm trying to do a new 3.10 release. 🙂 Note that there's a development branch of Emissary 4 that's already running Envoy 1.30: if you're interesting in testing that to see how it behaves here, let me know. |
Closing in favour of more targeted issue #5785 |
Describe the bug
emissary-ingress
running on fairly demanding environment with many resources/mappings/services etc.,ambassador
container receives SIGTERMemissary-ingress
pod is created and jumps right into30x
or200
.To Reproduce
Can work on providing a more concrete example if needed, but we are observing this after around 12-24h on a cluster as described.
Expected behaviour
Should be able to recover after being replaced. We have observed the SIGTERM due to a promotion which is expected, but also SIGKILL due to memory usage - the latter seems to have been fixed after increasing our resource limits.
Versions (please complete the following information):
Additional context
Possibly related:
Some relevant configuration options:
The text was updated successfully, but these errors were encountered: