Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nginx controller fails startup and 100% cpu usage with 100000 hosts #1386

Closed
Gregy opened this issue Sep 18, 2017 · 26 comments · Fixed by #1420
Closed

nginx controller fails startup and 100% cpu usage with 100000 hosts #1386

Gregy opened this issue Sep 18, 2017 · 26 comments · Fixed by #1420

Comments

@Gregy
Copy link
Contributor

Gregy commented Sep 18, 2017

Hi, I am testing a configuration with 100000 hosts split into 10 ingresses. Each host looks like this:

- host: "test99999.testcluster.test.com"
    http:
      paths:
      -
        path: /
        backend:
          serviceName: web
          servicePort: 80
      -
        path: /testecho
        backend:
          serviceName: echoheaders-x
          servicePort: 80

nginx controller 8.6.3 appears to work fine with this configuration but the latest 0.9.0-beta.13 fails to start. It gets stuck on startup with one controller process using 100% cpu. The monitoring port 10254 is inaccessible (connection times out). nginx.conf is never generated.

@aledbf
Copy link
Member

aledbf commented Sep 18, 2017

@Gregy please post the ingress logs

@Gregy
Copy link
Contributor Author

Gregy commented Sep 18, 2017

Sure.

Nginx config was still not updated and the nginx was not listening on 80 at the time the logs end.

default-verbosity.txt
verbose-logs.zip

@aledbf
Copy link
Member

aledbf commented Sep 18, 2017

@Gregy do you have a script that creates this ingress rules?
What are you trying to achieve with this test?

@Gregy
Copy link
Contributor Author

Gregy commented Sep 18, 2017

Script is here:

#!/bin/bash
cat <<EOF

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: testingress
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  backend:
    serviceName: web
    servicePort: 80
  rules:
EOF

for i in `seq $1 $2`; do
cat <<EOF
  - host: "test$i.testcluster.test.com"
    http:
      paths:
      -
        path: /
        backend:
          serviceName: web
          servicePort: 80
      -
        path: /testecho
        backend:
          serviceName: echoheaders-x
          servicePort: 80

EOF
done

I ran it 10 times to get 100000 hosts split across 10 ingresses.

I am trying to see if the ingress controller can scale to this number of hosts. I have a site with tens of thousands of subdomains that I need to manage (distribute the subdomain's traffic to multiple pods) and want to know if nginx-lb can do it. It works with the 0.8.3 but the beta gets stuck somewhere.

@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@Gregy please update the image to quay.io/aledbf/nginx-ingress-controller:0.225 and add the flag --update-status=false
This version contains several improvements in the utilization of cpu and ram resources.
I tested this changes with 12000 ingress rules.

That being said, I'm not sure using an ingress controller beyond 10000 rules makes sense. One of the reasons for this is that we use several informers in the controller (it's a mirror of the information contained in the api server, like ingress, secrets, services, endpoints, nodes) that consume resources (ram) and after any change in those resources we need to compare the running configuration with the new state of the cluster (his is a cpu intensive task) and reload nginx.

PR #1387 contains the changes that improve the controller

@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@Gregy another detail: please don't use just one ingress rule, use something like

COUNT=1
while [ $COUNT -le 12000 ]
do
echo "
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-$COUNT
spec:
  rules:
  - host: test$COUNT.testcluster.test.com
    http:
      paths:
      -
        path: /
        backend:
          serviceName: http-svc
          servicePort: 80
      -
        path: /testecho
        backend:
          serviceName: http-svc
          servicePort: 80
" | kubectl create -f -
COUNT=$(( $COUNT + 1 ))
done

@Gregy
Copy link
Contributor Author

Gregy commented Sep 19, 2017

Thank you very much.

With your image nginx config is generated and the loadbalancer is working as expected even with 100000 hosts. The cpu usage problem remains though. I am not talking about usage when regenerating the config, that is expected. One of the controller threads is spinning all the time even when no changes are being made. I took a cpu profile:

(pprof) top
29.82s of 30.05s total (99.23%)
Dropped 31 nodes (cum <= 0.15s)
Showing top 10 nodes out of 12 (cum >= 29.91s)
      flat  flat%   sum%        cum   cum%
    22.38s 74.48% 74.48%     26.89s 89.48%  k8s.io/ingress/core/pkg/ingress.(*Server).Equal
     3.02s 10.05% 84.53%     29.91s 99.53%  k8s.io/ingress/core/pkg/ingress.(*Configuration).Equal
     2.97s  9.88% 94.41%      2.97s  9.88%  runtime.memeqbody
     1.45s  4.83% 99.23%      1.45s  4.83%  runtime.eqstring
         0     0% 99.23%     29.91s 99.53%  k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).(k8s.io/ingress/core/pkg/ingress/controller.syncIngress)-fm
         0     0% 99.23%     29.91s 99.53%  k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).syncIngress
         0     0% 99.23%     29.91s 99.53%  k8s.io/ingress/core/pkg/task.(*Queue).(k8s.io/ingress/core/pkg/task.worker)-fm
         0     0% 99.23%     29.91s 99.53%  k8s.io/ingress/core/pkg/task.(*Queue).Run
         0     0% 99.23%     29.91s 99.53%  k8s.io/ingress/core/pkg/task.(*Queue).worker
         0     0% 99.23%     29.91s 99.53%  k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil

My guess would be it is this gorutine?

goroutine 103 [runnable]:
k8s.io/ingress/core/pkg/ingress.(*Server).Equal(0xc4282b2000, 0xc47c80f790, 0x18600)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/types_equals.go:276 +0x54a
k8s.io/ingress/core/pkg/ingress.(*Configuration).Equal(0xc420104880, 0xc42020b700, 0x0)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/types_equals.go:76 +0x205
k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).syncIngress(0xc4202e7590, 0x133e3e0, 0xc4203422d0, 0xc4203422d0, 0xc420342200)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/controller.go:284 +0xa35
k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).(k8s.io/ingress/core/pkg/ingress/controller.syncIngress)-fm(0x133e3e0, 0xc4203422d0, 0xa, 0xc4247dfdf8)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/controller.go:169 +0x3e
k8s.io/ingress/core/pkg/task.(*Queue).worker(0xc4202de570)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/task/queue.go:89 +0x11e
k8s.io/ingress/core/pkg/task.(*Queue).(k8s.io/ingress/core/pkg/task.worker)-fm()
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/task/queue.go:49 +0x2a
k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4205f7fa8)
	/home/aledbf/go/src/k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5e
k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc47ea9bfa8, 0x3b9aca00, 0x0, 0x1210e01, 0xc42022a480)
	/home/aledbf/go/src/k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4205f7fa8, 0x3b9aca00, 0xc42022a480)
	/home/aledbf/go/src/k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
k8s.io/ingress/core/pkg/task.(*Queue).Run(0xc4202de570, 0x3b9aca00, 0xc42022a480)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/task/queue.go:49 +0x55
created by k8s.io/ingress/core/pkg/ingress/controller.GenericController.Start
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/controller.go:1216 +0x4cb

Also one error shows up in the logs:

E0919 08:32:17.843102       6 queue.go:62] could not get key for object force-sync: object has no meta: object does not implement the Object interfaces

Regarding nginx-lb's suitability for the task:
I think any system capable of handling this kind of use case (lots and lots of subdomains, each with its own config) would need to regenerate and reload the config on each change. I think the resources taken would be comparable.

@Gregy
Copy link
Contributor Author

Gregy commented Sep 19, 2017

I thought the problem could be caused by too low sync-period so I tried to increase it by setting --sync-period=10h. The behavior remained the same.

@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@Gregy please use quay.io/aledbf/nginx-ingress-controller:0.226

@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

Closing. The refactoring is already present in master

@aledbf aledbf closed this as completed Sep 19, 2017
@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@Gregy please reopen if you still have issues after using the image quay.io/aledbf/nginx-ingress-controller:0.226

@Gregy
Copy link
Contributor Author

Gregy commented Sep 19, 2017

Thanks. I investigated the cpu usage again and this is the timeline:

T+0 - init
T+5 - log message "backend reload required"
T+95 - ingress backend successfully reloaded and nginx is correctly serving requests but ingress controller go code is still spinning its wheels in the gorutine I posted earlier
T+2130 - go code stops spinning

I guess I am missing something. What is the code doing for that half an hour if the nginx config is finished and working in the first 2 minutes?

I tried with both 0.225 and 0.226. The only difference seems to be that the error is gone from the logs in 0.226. Unfortunatelly adding new ingress after start doesn't work in 0.226. I get the event message in the logs but no "backend reload required" message and nginx config doesn't reload.

Unfortunatelly I cannot reopen the issue :(

@aledbf aledbf reopened this Sep 19, 2017
@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@Gregy some context about how we process the ingress rules: when the ingress controller start it receives the list of ingress from api server and an event handler is called (per ingress rules). Each call adds the ingress in a worker queue and in a different goroutine we check if the running configuration is equals to the new one or not. In case the configuration are not equals we trigger a nginx reload.
In 0.226 I changed the initial sync to avoid queuing all the rules avoiding the cpu usage you saw earlier.
Even when this change the ram usage in the ingress controller is too high and the cpu usage you see now is just the golang gc, not the equals comparison.

Please keep in mind that you need to disable the update of the status field using --update-status=false

Is this explanation clear enough? (please tell me)

@Gregy
Copy link
Contributor Author

Gregy commented Sep 19, 2017

Thanks!
I understand the ram usage is high. It is to be expected. My nginx-lb uses 20G (majority is used by nginx itself) for 100000 hosts and that is fine with me.

The log message "backend reload required" is printed after it compares the config and decides it needs to reload, right? If that is the case it is very quick and is not the cause of my problem.

I cannot see how the high cpu usage for 30 minutes from start is garbage collector's fault. It would collect something right? The ram usage during those 30 minutes is constant. Also it only seems to happen after start. When I edit the config, nginx reloads but the high cpu usage doesn't come back. Can I somehow disable gc to be sure?

Can you replicate my problem where the loadbalancer doesn't trigger reload after adding new ingress on 0.226? It works for me on 0.225 but not on 0.226.


I0919 19:20:05.845399       7 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"testingress1", UID:"8a60d824-9c77-11e7-808b-42010a800091", APIVersion:"extensions", ResourceVersion:"588646", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress default/testingress1
I0919 19:20:10.213901       7 controller.go:291] backend reload required
I0919 19:21:45.384423       7 controller.go:300] ingress backend successfully reloaded...
I0919 19:28:42.574588       7 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"testingress201000", UID:"bf149b1c-9d70-11e7-ac86-42010a800091", APIVersion:"extensions", ResourceVersion:"589224", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress default/testingress201000
<nothing happens here>

As you can see from the log nginx reloads fine on update event but not on create event.

@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@Gregy please use quay.io/aledbf/nginx-ingress-controller:0.227

@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@Gregy are you connected to the kubernetes slack?

@cu12
Copy link

cu12 commented Sep 19, 2017

Tested 225 and 226. I was eager to test as we also experienced high CPU usage, although on a smaller scale. Also we saw and still do see reload taking ~12 seconds.

Our setup:

  • 2 ingress controllers (limits are 1.5 cores and 10GB)
  • --sort-backends=true and --update-status=false
  • ~couple thousands of ingresses
  • frequent changes in ingresses

The changes you introduced here are indeed making the CPU usage lighter. Although reloads are still taking a lot of time. I also noticed that it took around 4 minutes for 225 to actually be able to serve requests, while with 226 it was 12 minutes.

Stats from 225
alt text

Stats from 226
alt text

Stats from 0.9.0-beta.13
alt text

memory usage above could be ignored, we have a faulty socket.io app ;)

@aledbf
Copy link
Member

aledbf commented Sep 19, 2017

@cu12 please update to 0.227

@cu12
Copy link

cu12 commented Sep 19, 2017

I tested with 227, it took ~18 minutes to serve the first requests, so I missed some of them as readinessProbe.initialDelaySeconds was set to 900.

Reload still takes ~12 seconds:

I0919 22:34:37.135251       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:35:01.751290       7 controller.go:292] backend reload required
I0919 22:35:13.914218       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:35:21.918488       7 controller.go:292] backend reload required
I0919 22:35:34.714913       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:35:41.564061       7 controller.go:292] backend reload required
I0919 22:35:55.310296       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:35:58.941131       7 controller.go:292] backend reload required
I0919 22:36:12.816486       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:36:17.984489       7 controller.go:292] backend reload required
I0919 22:36:30.978228       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:37:03.339946       7 controller.go:292] backend reload required
I0919 22:37:18.724050       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:37:22.534802       7 controller.go:292] backend reload required
I0919 22:37:36.557198       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:37:40.178868       7 controller.go:292] backend reload required
I0919 22:37:54.467715       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:37:58.044858       7 controller.go:292] backend reload required
I0919 22:38:11.107285       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:38:15.235995       7 controller.go:292] backend reload required
I0919 22:38:28.638520       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:38:32.136863       7 controller.go:292] backend reload required
I0919 22:38:45.727476       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:38:49.537017       7 controller.go:292] backend reload required
I0919 22:39:02.925242       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:39:06.572189       7 controller.go:292] backend reload required
I0919 22:39:20.810804       7 controller.go:301] ingress backend successfully reloaded...
I0919 22:39:24.469368       7 controller.go:292] backend reload required
I0919 22:39:37.479885       7 controller.go:301] ingress backend successfully reloaded...

@aledbf
Copy link
Member

aledbf commented Sep 20, 2017

@Gregy @cu12 please use quay.io/aledbf/nginx-ingress-controller:0.229

@cu12
Copy link

cu12 commented Sep 21, 2017

I was running 229 in the past 24 hours, here how does it looks like now. I haven't experienced any significant improvements from 227, reloads are still taking ~12 seconds.

alt text

@biwwy-zz
Copy link

@aledbf also about v0.229 was theresyncPeriod in listers.go changed inside the image from default 600s to something else? Because when running this particular version it took about 2h for a new certificate to get added to local store after secret creation.

@aledbf
Copy link
Member

aledbf commented Sep 25, 2017

haven't experienced any significant improvements from 227, reloads are still taking ~12 seconds.

There's nothing with can do about the reload time. This is the time it take nginx to reload with your running workload.

@Gregy
Copy link
Contributor Author

Gregy commented Sep 25, 2017

With 0.229 I get no significant cpu usage from pod start until the first ingress change. After I change ingress configuration and nginx reloads I get high cpu usage again.

Profile top:

    2390ms  6.94%  6.94%     7510ms 21.80%  runtime.mallocgc
    2260ms  6.56% 13.50%     4850ms 14.08%  runtime.scanobject
    1440ms  4.18% 17.68%     5980ms 17.36%  runtime.mapassign_fast64
    1230ms  3.57% 21.25%     1270ms  3.69%  runtime.heapBitsForObject
    1220ms  3.54% 24.79%     1220ms  3.54%  runtime.memclrNoHeapPointers
    1090ms  3.16% 27.95%    11360ms 32.98%  k8s.io/ingress/vendor/github.com/imdario/mergo.deepMerge
    1080ms  3.13% 31.09%    27180ms 78.90%  k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).getBackendServers
    1060ms  3.08% 34.17%     1060ms  3.08%  runtime.nextFreeFast
     980ms  2.84% 37.01%     1110ms  3.22%  runtime.heapBitsSetType
     870ms  2.53% 39.54%      870ms  2.53%  runtime.duffcopy

My guess would be k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).getBackendServers ?

Probably this gorutine?

goroutine 100578 [runnable]:
k8s.io/ingress/core/pkg/ingress/annotations/parser.GetIntAnnotation(0x155fd2f, 0x28, 0xc42fa72870, 0x1e401e0, 0x0, 0x0)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/annotations/parser/main.go:96 +0x105
k8s.io/ingress/core/pkg/ingress/annotations/healthcheck.healthCheck.Parse(0x7f558b195988, 0xc42044a2d0, 0xc42fa72870, 0xc4202b0a38, 0xc420227b60, 0x1e117f0, 0xc4202b0a10)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/annotations/healthcheck/main.go:55 +0xa8
k8s.io/ingress/core/pkg/ingress/annotations/healthcheck.(*healthCheck).Parse(0xc42022a590, 0xc42fa72870, 0xc4202b41c0, 0xc4235e3201, 0x4, 0x4)
	<autogenerated>:1 +0x4f
k8s.io/ingress/core/pkg/ingress/controller.(*annotationExtractor).Extract(0xc42044a250, 0xc42fa72870, 0x5)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/annotations.go:93 +0x112
k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).getBackendServers(0xc42044a1e0, 0xc42bc4dc20, 0x18, 0x0, 0x0, 0x0, 0x0)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/controller.go:486 +0x30d
k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).syncIngress(0xc42044a1e0, 0x1340ba0, 0xc452f32f90, 0xc452f32f90, 0xc452f32e00)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/controller.go:256 +0xb5
k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).(k8s.io/ingress/core/pkg/ingress/controller.syncIngress)-fm(0x1340ba0, 0xc452f32f90, 0xa, 0xc4235e3df8)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/controller.go:172 +0x3e
k8s.io/ingress/core/pkg/task.(*Queue).worker(0xc4201e8360)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/task/queue.go:89 +0x11e
k8s.io/ingress/core/pkg/task.(*Queue).(k8s.io/ingress/core/pkg/task.worker)-fm()
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/task/queue.go:49 +0x2a
k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42058ffa8)
	/home/aledbf/go/src/k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5e
k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4235e3fa8, 0x3b9aca00, 0x0, 0x1210e01, 0xc4204c0360)
	/home/aledbf/go/src/k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42058ffa8, 0x3b9aca00, 0xc4204c0360)
	/home/aledbf/go/src/k8s.io/ingress/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
k8s.io/ingress/core/pkg/task.(*Queue).Run(0xc4201e8360, 0x3b9aca00, 0xc4204c0360)
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/task/queue.go:49 +0x55
created by k8s.io/ingress/core/pkg/ingress/controller.(*GenericController).Start
	/home/aledbf/go/src/k8s.io/ingress/core/pkg/ingress/controller/controller.go:1230 +0x6b6

@aledbf you mentioned earlier that each change event is added to a queue and then processed? What happens when multiple ingresses are changed at the same time? Configuration is reloaded for all changed ingresses at once and the queue is cleared? Or is the nginx supposed to be reloaded for each ingress change one by one?

@cu12
Copy link

cu12 commented Sep 26, 2017

getBackendServers also bites us.

@aledbf
Copy link
Member

aledbf commented Sep 26, 2017

@cu12 please use quay.io/aledbf/nginx-ingress-controller:0.232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants