Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router makes zombies permanently #331

Open
Bregor opened this issue Mar 10, 2017 · 9 comments · Fixed by teamhephy/router#6 · May be fixed by #356
Open

Router makes zombies permanently #331

Bregor opened this issue Mar 10, 2017 · 9 comments · Fixed by teamhephy/router#6 · May be fixed by #356
Assignees
Labels

Comments

@Bregor
Copy link
Contributor

Bregor commented Mar 10, 2017

Kubernetes:

$ kubectl version --short
Client Version: v1.5.4
Server Version: v1.5.4

Deis:

$ helm list
NAME            	REVISION	UPDATED                 	STATUS  	CHART                    	NAMESPACE
deis-workflow   	6       	Thu Mar  9 12:41:00 2017	DEPLOYED	workflow-v2.12.0         	deis

Router:

$ kubectl get deployment -n deis deis-router -o jsonpath='{.spec.template.spec.containers[0].image}'
quay.io/deis/router:v2.11.0

Zombies (from ps auxffww):

_apt     30939  0.1  0.0 566916 18104 ?        Ssl  Mar09   2:39      |   \_ /opt/router/sbin/router
_apt     30959  0.0  0.0   4540    80 ?        S    Mar09   0:00      |       \_ cat
_apt     30972  0.0  0.0  28580  5128 ?        S    Mar09   0:00      |       \_ nginx: master process /opt/router/sbin/nginx
_apt     21911  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21912  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21913  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21914  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21915  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21916  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21917  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21918  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21919  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21920  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21921  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21922  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     30986  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt       557  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      3734  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      7037  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      9379  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      5466  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt     12015  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt     18298  0.0  0.0      0     0 ?        Z    10:46   0:00      |       \_ [nginx] <defunct>
_apt     19393  0.0  0.0      0     0 ?        Z    10:47   0:00      |       \_ [nginx] <defunct>
_apt     22430  0.0  0.0      0     0 ?        Z    14:02   0:00      |       \_ [nginx] <defunct>
_apt     24104  0.0  0.0      0     0 ?        Z    14:03   0:00      |       \_ [nginx] <defunct>
_apt     24564  0.0  0.0      0     0 ?        Z    14:03   0:00      |       \_ [nginx] <defunct>
_apt     25887  0.0  0.0      0     0 ?        Z    14:04   0:00      |       \_ [nginx] <defunct>
_apt     21910  0.0  0.0      0     0 ?        Z    14:14   0:00      |       \_ [nginx] <defunct>
@vdice
Copy link
Member

vdice commented Mar 10, 2017

@vdice vdice added this to the v2.13 milestone Mar 10, 2017
@mboersma mboersma added the bug label Mar 15, 2017
@bacongobbler
Copy link
Member

I'm able to see this as well with v2.12.0 after running for a few minutes:

><> kd get po | grep router
deis-router-1001573613-2rc22             1/1       Running   0          16m
><> kd exec deis-router-1001573613-2rc22 ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
router      25  0.0  0.1  34428  2952 ?        Rs   16:01   0:00 ps faux
router       1  0.0  1.1  97376 23152 ?        Ssl  15:46   0:00 /opt/router/sbin/router
router       7  0.0  0.0   4540   632 ?        S    15:46   0:00 cat
router      14  0.0  0.1  28332  3868 ?        S    15:46   0:00 nginx: master process /opt/router/sbin/nginx
router      23  0.0  0.2  28332  4156 ?        S    15:48   0:00  \_ nginx: worker process
router      24  0.0  0.1  28332  2480 ?        S    15:48   0:00  \_ nginx: worker process
router      16  0.0  0.0      0     0 ?        Z    15:46   0:00 [nginx] <defunct>
router      19  0.0  0.0      0     0 ?        Z    15:47   0:00 [nginx] <defunct>
router      22  0.0  0.0      0     0 ?        Z    15:48   0:00 [nginx] <defunct>

Going to try downgrading and see if I can diagnose when this started to occur.

@bacongobbler bacongobbler self-assigned this Mar 22, 2017
@bacongobbler
Copy link
Member

bacongobbler commented Mar 22, 2017

I was able to reproduce this using router versions v2.9.0, v2.10.0, v2.11.0, and the canary release. All of them showed zombie processes.

><> kd get po deis-router-1097387089-f6r4n -o yaml | grep canary | head -n 1
    image: quay.io/deisci/router:canary
><> kd exec deis-router-1097387089-f6r4n ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
router      31  0.0  0.1  34428  2892 ?        Rs   17:02   0:00 ps faux
router       1  0.1  1.4 162912 29220 ?        Ssl  17:00   0:00 /opt/router/sbin/router
router       8  0.0  0.0   4540   656 ?        S    17:00   0:00 cat
router      12  0.0  0.3  28312  6192 ?        S    17:00   0:00 nginx: master process /opt/router/sbin/nginx
router      22  0.0  0.1  28312  2592 ?        S    17:01   0:00  \_ nginx: worker process
router      23  0.0  0.1  28312  2592 ?        S    17:01   0:00  \_ nginx: worker process
router      15  0.0  0.0      0     0 ?        Z    17:00   0:00 [nginx] <defunct>
router      18  0.0  0.0      0     0 ?        Z    17:00   0:00 [nginx] <defunct>
router      21  0.0  0.0      0     0 ?        Z    17:01   0:00 [nginx] <defunct>

I wonder if this does have to do with kubernetes/kubernetes#39334 as @vdice mentioned, which in that case it should be resolved by upgrading to k8s v1.6.

@bacongobbler bacongobbler modified the milestones: v2.14, v2.13 Apr 3, 2017
@vdice
Copy link
Member

vdice commented May 1, 2017

@Bregor are you still seeing behavior like this on k8s clusters >= 1.5?

@Bregor
Copy link
Contributor Author

Bregor commented May 2, 2017

@vdice

$ kubectl version --short
Client Version: v1.6.2
Server Version: v1.5.7

...
_apt      1215  0.0  0.0      0     0 ?        Z    Apr14   0:00 [nginx] <defunct>
_apt      1241  0.0  0.0      0     0 ?        Z    Apr22   0:00 [nginx] <defunct>
_apt      2170  0.0  0.0      0     0 ?        Z    Apr14   0:00 [nginx] <defunct>
_apt      2550  0.0  0.0      0     0 ?        Z    Apr27   0:00 [nginx] <defunct>
_apt      3355  0.0  0.0      0     0 ?        Z    Apr28   0:00 [nginx] <defunct>
...

@Bregor
Copy link
Contributor Author

Bregor commented May 2, 2017

$ helm list
NAME            	REVISION	UPDATED                 	STATUS  	CHART                    	NAMESPACE
deis-workflow   	7       	Fri Apr  7 18:59:53 2017	DEPLOYED	workflow-v2.13.0         	deis

@Bregor
Copy link
Contributor Author

Bregor commented May 2, 2017

@vdice same here with kubernetes-1.6.2 (both client and server)

@vdice vdice removed this from the v2.14 milestone May 2, 2017
@felixbuenemann
Copy link
Contributor

I am also seeing this with around 1879 nginx zombies for the router pod. I also grepped the logs for "Router configuration has changed in k8s" and it was logged 1879 times, so zombie processed get produced during config reload.

If you look at nginx/commands.go the nginx server is reloaded by calling "nginx -s reload" using os.Exec()/ cmd.Start() but there are no calls to cmd.Wait(), so when the "nginx -s reload" command finishes, it is not cleaned up and creates a zombie process.

This is literally a one line fix, so I'll create a PR and maybe it still gets merged, even though Deis Workflow is EOL.

@kingdonb
Copy link

This issue was resolved in teamhephy/router#6

Deis team: we had someone find this issue when searching, and it was their problem. Turned out they are still using Deis Workflow. Our advice to them was to upgrade to the latest Hephy Workflow, which there is guidance on how to do at www.teamhephy.com.

Do you think we should get someone to go through all of the open issues, and mark them as closed (perhaps with a note to check with github.com/teamhephy/workflow for follow-up if help is needed?)

I don't want to make extra work for anyone, but maybe there is a script for cleaning up EOL repos out there somewhere already...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants