-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
502 and "Connection refused while connecting to upstream" scaling pods down (or deleting pods) #3639
Comments
No. The default value is to not do that. Please check https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#proxy-next-upstream The default value is |
@aledbf I did that already, see my configmap above 🙂 |
Sorry about that.
|
All good!
|
That means the pod is not ready to receive connections. This should not happen if the app is using probes
The configuration is working,
You see 502 in apachebench or just in the logs? |
We are using probes. Both
Ahh I see what you mean! It looks like this, where it tried and failed with
That's great to see it working but we still see messages like this from the controller also:
So I think the controller is doing the best it can.
Yes, we see it in AB. OK at this point I'll close this issue. Maybe I'll open a new issue about the 499 as I think that's different. But do you know how our pod could return |
From your description, you went from 10 to 5 pods but the ingress controller only retries 3 times (you can change that). If the retries are sent to the killed pods, there is a small window of time (less than a second) where the ingress controller is not up to date. Edit: you can test this removing only one pod. You should not see 502 |
This can also be the case when your app does not handle SIGTERM properly, in other words if it exits immediately. I'd suggest trying with following configuration for your deployment:
|
OK yes, that makes sense because in theory, the ingress controller could try 3 of the 5 stopping pods.
Tested, everything makes sense now!
Yes but I think the problem here is that there will always be a small window between the pod containers getting SIGTERM and the ingress-controller updating it's upstreams. If your app takes 5 seconds to stop, you won't see it. But for our app, a request might only last 50ms. And it shuts own correctly, i.e. finishes any remaining requests and then exits. So in theory, the app can gracefully shutdown and exit before the ingress-controller is up to date.
True and for nginx (not the controller but part of our application pod), apparently Anyway, in summary, I was able to avoid all 502s when scaling down by adding this to our application nginx container:
And this to our app container:
Thanks for taking the time @aledbf and @ElvinEfendi |
@max-rocket-internet thank you for posting this, this solved a major mystery that I've been after 3 years! How did you end up with |
@matti as long as you're using recent ingress-nginx versions reload does not matter. You want to sleep
|
I'm concerned about why this issue has been closed. The current behavior of ingress-nginx doesn't align with what I would expect from a properly functioning ingress controller. Using the Many users, including myself, have only discovered this problem after experiencing sporadic 503 errors during rolling updates (e.g., issues #8731, #7330, #6273). Isn't one of the main reasons for choosing Kubernetes its HA features? The explanation provided by @max-rocket-internet makes complete sense: If ingress-nginx polls the Kubernetes API every second, then for pods that terminate in less than a second, ingress-nginx may still have IPs in its list of endpoints that have already shut down, especially when there's minimal load and no in-flight requests. Kubernetes ClusterIP services seem to handle this situation correctly. Why can't ingress-nginx do the same? Alternatively, the docs should come with a prominent warning advising users that their pods mustn't shutdown too fast. |
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
NGINX Ingress controller version:
0.21.0 / git-b65b85cd9
Kubernetes version (use
kubectl version
):1.11
Environment:
AWS EKS
Installed from chart version
1.0.1
as daemonset.Chart config:
What happened:
When scaling down or deleting pods we see a few 502 errors returned to the client and this in the controller logs:
What you expected to happen:
No error visible for client.
How to reproduce it (as minimally and precisely as possible):
Run HTTP GET requests continuously. Delete some pods or downscale deployment.
Anything else we need to know:
Note I have set
proxy-next-upstream: error timeout http_502
as shown above.The deployment behind the ingress is also nginx and it sometimes shows error 499 which indicates that the ingress-controller is terminating the connections before completion?
The log from the controller shows this:
The text was updated successfully, but these errors were encountered: