Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health check probe for k8s upgrade containers. #15223

Merged
merged 8 commits into from
Jul 11, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions dockers/docker-config-engine-bullseye/Dockerfile.j2
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ RUN pip3 install redis==4.5.4

# Copy files
COPY ["files/swss_vars.j2", "/usr/share/sonic/templates/"]
COPY ["files/readiness_probe.sh", "/usr/bin/"]

## Clean up
RUN apt-get purge -y \
Expand Down
1 change: 1 addition & 0 deletions dockers/docker-config-engine-buster/Dockerfile.j2
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ RUN pip3 install redis==4.5.4

# Copy files
COPY ["files/swss_vars.j2", "/usr/share/sonic/templates/"]
COPY ["files/readiness_probe.sh", "/usr/bin/"]

## Clean up
RUN apt-get purge -y \
Expand Down
1 change: 1 addition & 0 deletions rules/docker-config-engine-bullseye.mk
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ $(DOCKER_CONFIG_ENGINE_BULLSEYE)_LOAD_DOCKERS += $(DOCKER_BASE_BULLSEYE)
$(DOCKER_CONFIG_ENGINE_BULLSEYE)_FILES += $(SWSS_VARS_TEMPLATE)
$(DOCKER_CONFIG_ENGINE_BULLSEYE)_FILES += $(RSYSLOG_PLUGIN_CONF_J2)
$(DOCKER_CONFIG_ENGINE_BULLSEYE)_FILES += $($(SONIC_CTRMGRD)_CONTAINER_SCRIPT)
$(DOCKER_CONFIG_ENGINE_BULLSEYE)_FILES += $($(SONIC_CTRMGRD)_HEALTH_PROBE)

$(DOCKER_CONFIG_ENGINE_BULLSEYE)_DBG_DEPENDS = $($(DOCKER_BASE_BULLSEYE)_DBG_DEPENDS) \
$(LIBSWSSCOMMON_DBG) \
Expand Down
1 change: 1 addition & 0 deletions rules/docker-config-engine-buster.mk
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ $(DOCKER_CONFIG_ENGINE_BUSTER)_LOAD_DOCKERS += $(DOCKER_BASE_BUSTER)
$(DOCKER_CONFIG_ENGINE_BUSTER)_FILES += $(SWSS_VARS_TEMPLATE)
$(DOCKER_CONFIG_ENGINE_BUSTER)_FILES += $(RSYSLOG_PLUGIN_CONF_J2)
$(DOCKER_CONFIG_ENGINE_BUSTER)_FILES += $($(SONIC_CTRMGRD)_CONTAINER_SCRIPT)
$(DOCKER_CONFIG_ENGINE_BUSTER)_FILES += $($(SONIC_CTRMGRD)_HEALTH_PROBE)

$(DOCKER_CONFIG_ENGINE_BUSTER)_DBG_DEPENDS = $($(DOCKER_BASE_BUSTER)_DBG_DEPENDS) \
$(LIBSWSSCOMMON_DBG) \
Expand Down
4 changes: 4 additions & 0 deletions rules/sonic-ctrmgrd.mk
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,16 @@ $($(SONIC_CTRMGRD)_CFG_JSON)_PATH = $($(SONIC_CTRMGRD)_FILES_PATH)
$(SONIC_CTRMGRD)_SERVICE = ctrmgrd.service
$($(SONIC_CTRMGRD)_SERVICE)_PATH = $($(SONIC_CTRMGRD)_FILES_PATH)

$(SONIC_CTRMGRD)_HEALTH_PROBE = readiness_probe.sh
$($(SONIC_CTRMGRD)_HEALTH_PROBE)_PATH = $($(SONIC_CTRMGRD)_FILES_PATH)

SONIC_PYTHON_WHEELS += $(SONIC_CTRMGRD)

$(SONIC_CTRMGRD)_FILES = $($(SONIC_CTRMGRD)_CONTAINER_SCRIPT)
$(SONIC_CTRMGRD)_FILES += $($(SONIC_CTRMGRD)_STARTUP_SCRIPT)
$(SONIC_CTRMGRD)_FILES += $($(SONIC_CTRMGRD)_CFG_JSON)
$(SONIC_CTRMGRD)_FILES += $($(SONIC_CTRMGRD)_SERVICE)
$(SONIC_CTRMGRD)_FILES += $($(SONIC_CTRMGRD)_HEALTH_PROBE)

SONIC_COPY_FILES += $($(SONIC_CTRMGRD)_FILES)

34 changes: 34 additions & 0 deletions src/sonic-ctrmgrd/ctrmgr/readiness_probe.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash
# This script is used by k8s to check the readiness of containers
# Check if the container is readiness or not, exit code 0 means readiness, others mean not readiness

#### exit code contract, k8s only cares zero or not none-zero, but we want to use none-zero code to indicate different error
# 0: readiness
# 1: python script crach exit code
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crach

typo #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

# 2: supervisor start service doesn't exit normally
# other exit code: returned by post_check_script, define in the post_check_script, should not include 1,2

# check if the start service exists
# if the start service exists, check if it exits normally
# if the start service doesn't exist normally, exit with code 2
pre_check_service_name="start"
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start

Will you check all the critical processes? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical processes unexpected event will be handled by the supervisord exit-listener for now, the listener will kill the container, I don't think we need to check them here. Is this correct?

supervisorctl status |awk '{print $1}' |grep -w $pre_check_service_name > /dev/null
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supervisorctl status

You can use one command

supervisorctl status start
``` #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only do "supervisorctl status start", We can't do judgement by exit code, because start not existing and some failed state exit codes are the same. If only do "supervisorctl status start", need to judge by the outputs "start: ERROR (no such process)", "start EXITED Jun 21 05:28 PM". I do checking whether start exists in advance, I think code logic is more easy to understand here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example

root@sonic:/# supervisorctl status start
start                            EXITED    Jul 04 12:38 AM
root@sonic:/# supervisorctl status
dependent-startup                EXITED    Jul 04 12:38 AM
lldp-syncd                       RUNNING   pid 26, uptime 0:03:54
lldpd                            RUNNING   pid 20, uptime 0:03:57
lldpmgrd                         RUNNING   pid 30, uptime 0:03:52
rsyslogd                         RUNNING   pid 11, uptime 0:04:02
start                            EXITED    Jul 04 12:38 AM
supervisor-proc-exit-listener    RUNNING   pid 10, uptime 0:04:04
waitfor_lldp_ready               EXITED    Jul 04 12:38 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

start_check_result=$?
if [ $start_check_result = 0 ] && [ $(supervisorctl status $pre_check_service_name |awk '{print $2}') != 'EXITED' ]; then
exit 2
fi

# feature owner can add their own readiness check script
# check if the post_check_script exists
# if the post_check_script exists, run it
# if the post_check_script exits with non-zero code, exit with the code
post_check_script="/usr/bin/readiness_probe.py"
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/bin/readiness_probe.py

Do not assume python3.
How about /usr/bin/readiness_probe_hook.

if [ -x $post_check_script ]; then
    $post_check_script

#Closed

Copy link
Contributor Author

@lixiaoyuner lixiaoyuner Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if [ -f $post_check_script ]; then
python3 $post_check_script
post_check_result=$?
if [ $post_check_result != 0 ]; then
exit $post_check_result
fi
fi

exit 0