-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't achieve graceful shutdown of pods using Ambassador+ConsulConnect #3264
Comments
Update on this: We wanted to see if the AMBASSADOR_FAST_RECONFIGURE variable was likely to help us, if we could get it to run. So, for testing purposes only, managed to hack around the error mentioned above
...by setting a specific IP address in the Consul Resolver k8s resource instead of With that test environment, we were able to see that enabling AMBASSADOR_FAST_RECONFIGURE and a preStop handler on our deployment for our service-under-test:
...we were able to see that AMBASSADOR_FAST_RECONFIGURE does indeed decrease the number of 503's we see happening when pods terminate. In a lot of cases, it's zero 503's that happen, but sometimes is 1 or 2 when pods go down. So that's a significant improvement at least. I experimented with a retry_policy for "5xx" on mappings that use the Consul Resolver, but didn't see that push the failures to zero. Need to do more testing in that area. I see that #3182 is merged and slotted for 1.13, so I will close this once 1.13 is out and confirmed to work with the Consul Resolver. |
Update:
|
With the learnings posted above, the problems mentioned directly in this issue are mitigated. This can now be closed. There are further issues that we see creating instability in our upstreams, but those are handled more specifically in other issues: |
Describe the bug
I tried various ways of getting to the point of full gracefulness when either:
a) doing a rolling deploy (and thus pods come down 1 at a time)
b) doing a scale down (and thus a bunch of pods come down simultaneously)
...but I have not found a way to make it so that 0 requests get dropped when pods terminate. On the flip side, for pod launch (e.g. scale up), we do better - some combination of the health checks and perhaps delays are making that operation work fairly gracefully in my testing. Termination is the problem.
I am looking for any suggestions; any help would be hugely appreciated.
To Reproduce
This is kind of involved, but here's the setup:
${HOST_IP}:8500
Results:
Expected behavior
To not have errors (HTTP 503's) for requests during the termination of a pod running in Consul Connect
Versions (please complete the following information):
Additional context
What Kinds of Things Have I Tried:
I suspect that AES is not keeping up with the mesh changes. In looking through the AES debug logs, I find lines like this:
...and if I tail | grep for those while the deploy or scale down is happening, I see that these logs can come many seconds later. I can't say that the 503's stop happening right when these logs come (which would be evidence), but I'm of course grasping at straws a bit.
WHAT ABOUT AMBASSADOR_FAST_RECONFIGURE?
${HOST_IP}
in ConsulResolver does not work without legacy mode #3182Logs of non-functional AES 1.12.0 w/ AMBASSADOR_FAST_RECONFIGURE: true:
The text was updated successfully, but these errors were encountered: