zombie processes caused by health checks #2441

fmax · 2020-04-27T13:47:56Z

Which chart:
bitnami redis template version 9.0.2 and also 10.5.7

Describe the bug
on related docker nodes, you'll find redis-cli zombie processes:

# ps -ef | grep defunc
1001       920 29540  0 04:17 ?        00:00:00 [redis-cli] <defunct>
1001      1519 29540  0 Apr22 ?        00:00:00 [redis-cli] <defunct>
1001      9561 14677  0 Apr15 ?        00:00:00 [redis-cli] <defunct>
1001     12851 29540  0 Apr16 ?        00:00:00 [redis-cli] <defunct>
1001     16168 29540  0 Apr17 ?        00:00:00 [redis-cli] <defunct>
1001     22101 29540  0 Apr18 ?        00:00:00 [redis-cli] <defunct>
1001     24210 14677  0 07:58 ?        00:00:00 [redis-cli] <defunct>
1001     24285 14677  0 Apr22 ?        00:00:00 [redis-cli] <defunct>
1001     27874 14677  0 Apr21 ?        00:00:00 [redis-cli] <defunct>
1001     28288 14677  0 Apr16 ?        00:00:00 [redis-cli] <defunct>

those zombies seem to get caused from readiness / liveness health checks, when slave or master could not be reached within related timeout.

see also https://github.com/bitnami/bitnami-docker-redis/issues/165

To Reproduce
Steps to reproduce the behavior:

simulate connection problem by iptables DROP rule: iptables -I INPUT -p tcp --dport 6379 -j DROP
login into redis-master or redis-slave container (pod) and execute health check:

docker exec -ti redis-master  bash
 
I have no name!@af3cc23c7511:/$ timeout -s 9 1 redis-cli -a 1q2w3e4r --no-auth-warning -h 192.168.99.99 -p 6379 ping
Killed

look at the OS of docker node:

# ps -ef | grep defunc
karli     3681  2105  0 09:42 pts/0    00:00:00 [redis-cli] <defunct>
root      3683  2832  0 09:42 pts/1    00:00:00 grep --color=auto defunc

Expected behavior
having no zombie processes on docker nodes after some days the docker redis service is running.

Version of Helm and Kubernetes:

Output of helm version:

branch": "master",
"commit": "f5f5d10c5255216e757f8bec5651aa8a"

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T17:01:15Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

Additional context
after looking inside the health checks mounted from helm chart via config map at /health and running some tests, i found a solution:
instead of timeout -s 9 just use timeout -s 3 in following scripts:

ping_liveness_local.sh
ping_liveness_master.sh
ping_readiness_local.sh
ping_readiness_master.sh
which are dynamic generated by https://github.com/bitnami/charts/blob/master/bitnami/redis/templates/health-configmap.yaml
when using kill signal 3 instead of 9, no zombie process will be spawned any more

because the kill signal for the timeout command is hard coded, please change it to 3 or replace it with an environment variable for more flexibility.
thx

The text was updated successfully, but these errors were encountered:

javsalgar · 2020-04-28T09:20:21Z

Hi,

Thank you for the input! We will work on fixing the issue. I will put PR as soon as I have one.

viceice · 2024-01-17T13:34:08Z

i still get those zombies with current version, only setting shareProcessNamespace=true seems to solve it. will investigate

pascal-hofmann · 2024-11-18T08:34:57Z

See #10002 (comment). The same happens for the redis / valkey containers, with the exception, that these do not reap childrens regularly. Thus we run out of pids in the cgroup and redis/valkey crashes at some point in time. We were not able to fix this using a different timeoutSeconds value. I think this can only be fixed by enabling shareProcessNamespace or removing the timout command from the readiness/liveness probes.

javsalgar mentioned this issue Apr 28, 2020

[bitnami/redis] Fix zombie processes in readiness/liveness check #2453

Merged

4 tasks

carrodher closed this as completed in #2453 May 1, 2020

kabakaev mentioned this issue Aug 30, 2020

[bitnami/redis] prevent zombie PIDs in redis health checks #3559

Merged

4 tasks

viceice mentioned this issue Jan 28, 2021

[redis-cluster] zombie processes caused by health checks #5328

Closed

giskou mentioned this issue Mar 3, 2021

fix(argo-cd): Upgrade redis-ha to v4.10.4 argoproj/argo-helm#608

Merged

5 tasks

carrodher added the redis label Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zombie processes caused by health checks #2441

zombie processes caused by health checks #2441

fmax commented Apr 27, 2020

javsalgar commented Apr 28, 2020

viceice commented Jan 17, 2024

pascal-hofmann commented Nov 18, 2024

zombie processes caused by health checks #2441

zombie processes caused by health checks #2441

Comments

fmax commented Apr 27, 2020

javsalgar commented Apr 28, 2020

viceice commented Jan 17, 2024

pascal-hofmann commented Nov 18, 2024