Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zombie processes caused by health checks #2441

Closed
fmax opened this issue Apr 27, 2020 · 3 comments · Fixed by #2453 or #3559
Closed

zombie processes caused by health checks #2441

fmax opened this issue Apr 27, 2020 · 3 comments · Fixed by #2453 or #3559
Labels

Comments

@fmax
Copy link

fmax commented Apr 27, 2020

Which chart:
bitnami redis template version 9.0.2 and also 10.5.7

Describe the bug
on related docker nodes, you'll find redis-cli zombie processes:

# ps -ef | grep defunc
1001       920 29540  0 04:17 ?        00:00:00 [redis-cli] <defunct>
1001      1519 29540  0 Apr22 ?        00:00:00 [redis-cli] <defunct>
1001      9561 14677  0 Apr15 ?        00:00:00 [redis-cli] <defunct>
1001     12851 29540  0 Apr16 ?        00:00:00 [redis-cli] <defunct>
1001     16168 29540  0 Apr17 ?        00:00:00 [redis-cli] <defunct>
1001     22101 29540  0 Apr18 ?        00:00:00 [redis-cli] <defunct>
1001     24210 14677  0 07:58 ?        00:00:00 [redis-cli] <defunct>
1001     24285 14677  0 Apr22 ?        00:00:00 [redis-cli] <defunct>
1001     27874 14677  0 Apr21 ?        00:00:00 [redis-cli] <defunct>
1001     28288 14677  0 Apr16 ?        00:00:00 [redis-cli] <defunct>

those zombies seem to get caused from readiness / liveness health checks, when slave or master could not be reached within related timeout.

see also https://github.com/bitnami/bitnami-docker-redis/issues/165

To Reproduce
Steps to reproduce the behavior:

  1. simulate connection problem by iptables DROP rule: iptables -I INPUT -p tcp --dport 6379 -j DROP

  2. login into redis-master or redis-slave container (pod) and execute health check:

docker exec -ti redis-master  bash
 
I have no name!@af3cc23c7511:/$ timeout -s 9 1 redis-cli -a 1q2w3e4r --no-auth-warning -h 192.168.99.99 -p 6379 ping
Killed
  1. look at the OS of docker node:
# ps -ef | grep defunc
karli     3681  2105  0 09:42 pts/0    00:00:00 [redis-cli] <defunct>
root      3683  2832  0 09:42 pts/1    00:00:00 grep --color=auto defunc

Expected behavior
having no zombie processes on docker nodes after some days the docker redis service is running.

Version of Helm and Kubernetes:

  • Output of helm version:
branch": "master",
"commit": "f5f5d10c5255216e757f8bec5651aa8a"
  • Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T17:01:15Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

Additional context
after looking inside the health checks mounted from helm chart via config map at /health and running some tests, i found a solution:
instead of timeout -s 9 just use timeout -s 3 in following scripts:

because the kill signal for the timeout command is hard coded, please change it to 3 or replace it with an environment variable for more flexibility.
thx

@javsalgar
Copy link
Contributor

Hi,

Thank you for the input! We will work on fixing the issue. I will put PR as soon as I have one.

@viceice
Copy link
Contributor

viceice commented Jan 17, 2024

i still get those zombies with current version, only setting shareProcessNamespace=true seems to solve it. will investigate

@pascal-hofmann
Copy link

See #10002 (comment). The same happens for the redis / valkey containers, with the exception, that these do not reap childrens regularly. Thus we run out of pids in the cgroup and redis/valkey crashes at some point in time. We were not able to fix this using a different timeoutSeconds value. I think this can only be fixed by enabling shareProcessNamespace or removing the timout command from the readiness/liveness probes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants