-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nginx controller fails startup and 100% cpu usage with 100000 hosts #1386
Comments
@Gregy please post the ingress logs |
Sure. Nginx config was still not updated and the nginx was not listening on 80 at the time the logs end. |
@Gregy do you have a script that creates this ingress rules? |
Script is here:
I ran it 10 times to get 100000 hosts split across 10 ingresses. I am trying to see if the ingress controller can scale to this number of hosts. I have a site with tens of thousands of subdomains that I need to manage (distribute the subdomain's traffic to multiple pods) and want to know if nginx-lb can do it. It works with the 0.8.3 but the beta gets stuck somewhere. |
@Gregy please update the image to That being said, I'm not sure using an ingress controller beyond 10000 rules makes sense. One of the reasons for this is that we use several informers in the controller (it's a mirror of the information contained in the api server, like ingress, secrets, services, endpoints, nodes) that consume resources (ram) and after any change in those resources we need to compare the running configuration with the new state of the cluster (his is a cpu intensive task) and reload nginx. PR #1387 contains the changes that improve the controller |
@Gregy another detail: please don't use just one ingress rule, use something like
|
Thank you very much. With your image nginx config is generated and the loadbalancer is working as expected even with 100000 hosts. The cpu usage problem remains though. I am not talking about usage when regenerating the config, that is expected. One of the controller threads is spinning all the time even when no changes are being made. I took a cpu profile:
My guess would be it is this gorutine?
Also one error shows up in the logs:
Regarding nginx-lb's suitability for the task: |
I thought the problem could be caused by too low sync-period so I tried to increase it by setting --sync-period=10h. The behavior remained the same. |
@Gregy please use |
Closing. The refactoring is already present in master |
@Gregy please reopen if you still have issues after using the image quay.io/aledbf/nginx-ingress-controller:0.226 |
Thanks. I investigated the cpu usage again and this is the timeline: T+0 - init I guess I am missing something. What is the code doing for that half an hour if the nginx config is finished and working in the first 2 minutes? I tried with both 0.225 and 0.226. The only difference seems to be that the error is gone from the logs in 0.226. Unfortunatelly adding new ingress after start doesn't work in 0.226. I get the event message in the logs but no "backend reload required" message and nginx config doesn't reload. Unfortunatelly I cannot reopen the issue :( |
@Gregy some context about how we process the ingress rules: when the ingress controller start it receives the list of ingress from api server and an event handler is called (per ingress rules). Each call adds the ingress in a worker queue and in a different goroutine we check if the running configuration is equals to the new one or not. In case the configuration are not equals we trigger a nginx reload. Please keep in mind that you need to disable the update of the status field using --update-status=false Is this explanation clear enough? (please tell me) |
Thanks! The log message "backend reload required" is printed after it compares the config and decides it needs to reload, right? If that is the case it is very quick and is not the cause of my problem. I cannot see how the high cpu usage for 30 minutes from start is garbage collector's fault. It would collect something right? The ram usage during those 30 minutes is constant. Also it only seems to happen after start. When I edit the config, nginx reloads but the high cpu usage doesn't come back. Can I somehow disable gc to be sure? Can you replicate my problem where the loadbalancer doesn't trigger reload after adding new ingress on 0.226? It works for me on 0.225 but not on 0.226.
As you can see from the log nginx reloads fine on update event but not on create event. |
@Gregy please use quay.io/aledbf/nginx-ingress-controller:0.227 |
@Gregy are you connected to the kubernetes slack? |
Tested Our setup:
The changes you introduced here are indeed making the CPU usage lighter. Although reloads are still taking a lot of time. I also noticed that it took around 4 minutes for memory usage above could be ignored, we have a faulty socket.io app ;) |
@cu12 please update to 0.227 |
I tested with Reload still takes ~12 seconds:
|
@aledbf also about v0.229 was the |
There's nothing with can do about the reload time. This is the time it take nginx to reload with your running workload. |
With 0.229 I get no significant cpu usage from pod start until the first ingress change. After I change ingress configuration and nginx reloads I get high cpu usage again. Profile top:
My guess would be k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).getBackendServers ? Probably this gorutine?
@aledbf you mentioned earlier that each change event is added to a queue and then processed? What happens when multiple ingresses are changed at the same time? Configuration is reloaded for all changed ingresses at once and the queue is cleared? Or is the nginx supposed to be reloaded for each ingress change one by one? |
|
@cu12 please use quay.io/aledbf/nginx-ingress-controller:0.232 |
Hi, I am testing a configuration with 100000 hosts split into 10 ingresses. Each host looks like this:
nginx controller 8.6.3 appears to work fine with this configuration but the latest 0.9.0-beta.13 fails to start. It gets stuck on startup with one controller process using 100% cpu. The monitoring port 10254 is inaccessible (connection times out). nginx.conf is never generated.
The text was updated successfully, but these errors were encountered: