Inconsistent ingress secret synchronization across nodes #2068

frnckdlprt · 2018-02-12T00:49:45Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): potential bug

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): default backend 404, sync, secrets, ingress, tls

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

NGINX Ingress controller version:
v0.10.2

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:28:34Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: 9 worker node cluster
OS (e.g. from /etc/os-release): Ubuntu 16.0.3
Kernel (e.g. uname -a):
Install tools: Ansible, Helm
Others:

What happened:

We are randomly getting "default backend 404" depending on which node (out of 9) handles the request, and what point in time. Those occurrences seem much reduced if we give time for a given deployment to "settle in". The nodes responding with 404 typically show in ingress log that "adding secret ... to the local store" was much delayed compared to other nodes for the same deployment.

While trying to troubleshoot this we came accross this line which seems suspicious:
https://github.com/kubernetes/ingress-nginx/blob/nginx-0.10.2/internal/ingress/controller/store/backend_ssl.go#L199
This used to be a "continue" instead of "return", and it is seems odd that this should give up on all remaining ingresses.

What you expected to happen:

No 404 consistently across all the worker nodes, once the application starts responding through at least one node (or within a few seconds)

How to reproduce it (as minimally and precisely as possible):

Deploy ingresses with TLS secrets, observe the time the secret is added to the local store for each ingress controller pod.
See below one example of timing where one node is behind by 40min and another by 90min:

ngress-nginx-ingress-controller-4rrlj.log
219612:I0209 03:14:53.185798       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-7pl55.log
218480:I0209 03:10:54.994316       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-6j5pp.log
227555:I0209 03:51:59.353796       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-bbgr2.log
234364:I0209 04:39:41.330655       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-q27kj.log
219909:I0209 03:08:15.045453       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-clt27.log
231931:I0209 03:05:43.346984       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-pcwf9.log
262087:I0209 03:06:12.650154       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-vlvkd.log
204711:I0209 03:13:46.820651       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

ingress-nginx-ingress-controller-vzg8r.log
217407:I0209 03:01:32.340037       7 backend_ssl.go:68] adding secret mynamespace/mytlssecret to the local store

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

aledbf · 2018-02-12T01:05:58Z

@frnckdlprt thank you for the report. Please use quay.io/aledbf/nginx-ingress-controller:0.324. It contains #2069

frnckdlprt · 2018-02-12T06:56:18Z

@aledbf thank you much for your quick response. I could see 2 test runs going ok without 404, and each node showed "adding secret ... to the local store" within 5s of each other. More testing to come but this looks good, thanks again.

aledbf mentioned this issue Feb 12, 2018

Do not cancel the synchronization of secrets #2069

Merged

aledbf closed this as completed in #2069 Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent ingress secret synchronization across nodes #2068

Inconsistent ingress secret synchronization across nodes #2068

frnckdlprt commented Feb 12, 2018

aledbf commented Feb 12, 2018

frnckdlprt commented Feb 12, 2018

Inconsistent ingress secret synchronization across nodes #2068

Inconsistent ingress secret synchronization across nodes #2068

Comments

frnckdlprt commented Feb 12, 2018

aledbf commented Feb 12, 2018

frnckdlprt commented Feb 12, 2018