-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream IP inconsistent with pod IP after pod deletion #768
Comments
Any pointers on how to best debug this problem would be appreciated as I'm not very familiar with I added in a debug message showing the
There was no message in the log showing the diffed nginx.conf. New pod IP should be 10.2.0.26 but is the original:
I also verified the new pod IP from k8s perspective:
Once the controller gets into this bad state no more nginx.conf updates occur. Log messages showing the UpdateFunc call after another pod deletion:
There is no nginx.conf diff update message in the log and the IP is still the original:
I even tried deleting and adding new ingress definitions. The nginx.conf file gets no updates whatsoever. Any tips on digging deeper? |
@caseylucas I cannot reproduce this issue with k8s 1.6 or 1.5.7 |
@aledbf So you think it's related to 1.4? Indeed, it is tough to replicate for me too however I'd like to avoid having to upgrade our k8s cluster (for now) if possible. Maybe some recommended spots to dig deeper? |
@aledbf I was able to get get a trace after things seemed to lock up. Keep in mind I'm not a golang expert but I think that the problem may be an attempted recursive write lock of a RWMutex. I added a few debug log statements so the line number may be slightly off but you should be able to follow the stack trace. See the two lines "// CASEY: " below. At k8s.io/client-go/tools/cache/delta_fifo.go:451,
There are other goroutines blocked in
|
@caseylucas please update the image to |
i'll give it a try. It normally takes a while to see the problem. Can you send me a link to your changes (or a diff/patch) in case I want to merge them into my version that has a few more debug messages? Thanks! BTW. |
@caseylucas here #792 |
@aledbf I merged your changes into mine and It's been running for 7+ hours and still looking good. 😄 I'm surprised this hasn't bitten more people before - but good to get it fixed! |
I am still facing this using |
@ese Can you get stack traces of an ingress that seems to be having the problem? I pulled one like this:
Dump goroutines:
We should be able to see from the stack traces if you're seeing the same problem. |
@caseylucas Thanks for the tips to debugging. Here is the file from an nginx-controller with the problem |
@ese I verified that you are not seeing the exact same problem I was seeing. Sorry 😞 . The easy way to spot it is to find two |
@ese Can you confirm that you are seeing no more updates whatsoever to nginx.conf file once your ingress controller is messed up - even if you delete pods, make ingress definition changes, etc.? |
Problem
We noticed that after pod deletion and waiting for pods to restart, we were getting errors (500, 502, etc.) Once the errors start for a virtual host, they remain until we restart the ingress.
Log
In nginx log, we noticed connection refused.
Note the upstream IP: 10.2.1.25
I assume that 10.2.1.25 is the previous pod's IP.
The pod's current IP (after pod deletion and auto restart) is actually: 10.2.3.14
Within the cluster I can use curl to hit the pod's ip (10.2.3.14) and get back expected results.
Versions
ingress:
k8s: v1.4.5
Pod Info
nginx.conf oddness
Note the same IP as in the connection refused log message.
Dump new nginx.conf:
I was killing random pods during testing. Others look wrong but the last one on port 5000 is the one I originally noticed was wrong.
After restart, all is good because the configs are correct again.
The text was updated successfully, but these errors were encountered: