[healthd] fix healthd shutdown race #19504

stepanblyschak · 2024-07-08T07:41:24Z

Why I did it

To fix errors that happen when writing to the queue:

Jun  5 23:04:41.798613 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.798985 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.799535 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.806010 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.814075 r-leopard-56 ERR healthd: system_service[Errno 104] Connection reset by peer
Jun  5 23:04:41.824135 r-leopard-56 ERR healthd: Traceback (most recent call last):#012  File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 484, in system_service#012    msg = self.myQ.get(timeout=QUEUE_TIMEOUT)#012  File "<string>", line 2, in get#012  File "/usr/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod#012    kind, result = conn.recv()#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 255, in recv#012    buf = self._recv_bytes()#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes#012    buf = self._recv(4)#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv#012    chunk = read(handle, remaining)#012ConnectionResetError: [Errno 104] Connection reset by peer
Jun  5 23:04:41.826489 r-leopard-56 INFO healthd[8494]: ERROR:dbus.connection:Exception in handler for D-Bus signal:
Jun  5 23:04:41.826591 r-leopard-56 INFO healthd[8494]: Traceback (most recent call last):
Jun  5 23:04:41.826640 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3/dist-packages/dbus/connection.py", line 232, in maybe_handle_message
Jun  5 23:04:41.826686 r-leopard-56 INFO healthd[8494]:     self._handler(*args, **kwargs)
Jun  5 23:04:41.826738 r-leopard-56 INFO healthd[8494]:   File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 82, in on_job_removed
Jun  5 23:04:41.826785 r-leopard-56 INFO healthd[8494]:     self.task_notify(msg)
Jun  5 23:04:41.826831 r-leopard-56 INFO healthd[8494]:   File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 110, in task_notify
Jun  5 23:04:41.826877 r-leopard-56 INFO healthd[8494]:     self.task_queue.put(msg)
Jun  5 23:04:41.826923 r-leopard-56 INFO healthd[8494]:   File "<string>", line 2, in put
Jun  5 23:04:41.826973 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/managers.py", line 808, in _callmethod
Jun  5 23:04:41.827018 r-leopard-56 INFO healthd[8494]:     conn.send((self._id, methodname, args, kwds))
Jun  5 23:04:41.827065 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 211, in send
Jun  5 23:04:41.827115 r-leopard-56 INFO healthd[8494]:     self._send_bytes(_ForkingPickler.dumps(obj))
Jun  5 23:04:41.827158 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
Jun  5 23:04:41.827199 r-leopard-56 INFO healthd[8494]:     self._send(header + buf)
Jun  5 23:04:41.827254 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 373, in _send
Jun  5 23:04:41.827322 r-leopard-56 INFO healthd[8494]:     n = write(self._handle, buf)
Jun  5 23:04:41.827368 r-leopard-56 INFO healthd[8494]: BrokenPipeError: [Errno 32] Broken pipe
Jun  5 23:04:42.800216 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...

When the multiprocessing.Manager is shutdown the queue will raise the above errors. This happens during shutdown - fast-reboot, warm-reboot.

With the fix, system-health service does not hang:

root@sonic:/home/admin# sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"
Thu Oct 17 01:07:56 PM IDT 2024: Stopping...
Thu Oct 17 01:07:58 PM IDT 2024: Stopped
root@sonic:/home/admin# sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"
Thu Oct 17 01:08:13 PM IDT 2024: Stopping...
Thu Oct 17 01:08:14 PM IDT 2024: Stopped
root@sonic:/home/admin# sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"
Thu Oct 17 01:09:05 PM IDT 2024: Stopping...
Thu Oct 17 01:09:06 PM IDT 2024: Stopped

Work item tracking

Microsoft ADO (number only):

How I did it

Remove the call to shutdown, the cleanup will happen automatically when GC runs as per documentation - https://docs.python.org/3/library/multiprocessing.html

How to verify it

Run warm-reboot, fast-reboot multiple times and verify no errors in the log.

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

liat-grozovik · 2024-07-09T17:07:34Z

@Junchao-Mellanox could you please help to review?
@stepanblyschak please followup on the pr checker failure

liat-grozovik · 2024-07-14T12:52:00Z

@lguohan @yxieca please help to review and lets for sure take to 202311 and 202405

stepanblyschak · 2024-07-21T12:49:43Z

/azpw run Azure.sonic-buildimage

mssonicbld · 2024-07-21T12:49:45Z

/AzurePipelines run Azure.sonic-buildimage

azure-pipelines · 2024-07-21T12:49:57Z

Azure Pipelines successfully started running 1 pipeline(s).

vivekrnv · 2024-08-30T20:08:56Z

This change is causing some other side effects,

During the shutdown flow, system-health is taking > 90 sec for it to shutdown.

root@sonic:/home/admin# journalctl -f -u system-health
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ multi-user.target ]
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ getty.target ]
Aug 28 19:56:20 r-bobcat-01 systemd[1]: Stopping system-health.service - SONiC system health monitor...
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[13967]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[14104]: Caught SIGTERM - exiting...
Aug 28 19:56:21 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Attempting to kill sysmon main process with pid 14113
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Sysmon main process with pid 14113 could not be killed
Aug 28 19:56:32 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: State 'final-sigterm' timed out. Killing.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Killing process 14123 (healthd) with signal SIGKILL.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Failed with result 'timeout'.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: Stopped system-health.service - SONiC system health monitor.

I think, because of the shutdown call removal, the msg = self.myQ.get(timeout=QUEUE_TIMEOUT) is blocked for 15 sec. Within 10 sec after the SIGTERM (

sonic-buildimage/src/system-health/health_checker/sysmonitor.py

Line 517 in a8c54ad

os.kill(self._task_process.pid, signal.SIGKILL)

), The SIGKIll is given and ihe sysmon process gets terminated without ever issuing the task_stop for it's child processes causing it them to block the stop flow for 90 sec until the systemd actually issues SIGKILL.

Please note that this is not seen everytime, when the Q has some items during this time, the get call gets returned and it continues without any problem. The probability of repro is 25%

liat-grozovik · 2024-09-03T08:29:35Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2024-09-03T08:29:47Z

Azure Pipelines successfully started running 1 pipeline(s).

prgeor · 2024-09-03T16:49:27Z

@stepanblyschak please address the above issue from Vivek

prgeor · 2024-10-15T06:19:19Z

@stepanblyschak PR still in draft?

prgeor · 2024-10-15T06:19:26Z

@stepanblyschak fix build errors?

stepanblyschak · 2024-10-16T07:29:17Z

@prgeor Please see the bug report by @vivekrnv. This PR re-introduces another problem, so the fix must handle both - the error that is logged during shutdown and the occasional hang of healthd during shutdown

stepanblyschak · 2024-10-16T14:52:49Z

I investigate the issue a bit more and there are a couple of problems:

Shutdown of the queue will cause errors (see PR description) in other sub processes:

        #Clear the resources of mpmgr- Queue
        self.mpmgr.shutdown()

The shutdown logic will wait at most QUEUE_TIMEOUT = 15 sec before it is able to exit from the task loop.

        while not self.task_stopping_event.is_set():
            try:
                msg = self.myQ.get(timeout=QUEUE_TIMEOUT)
                event = msg["unit"]

The queue self.myQ is blocked for QUEUE_TIMEOUT and we cannot detect the self.task_stopping_event.

Joining the process will exit due to timeout since QUEUE_TIMEOUT > self._stop_timeout_secs.

        # Wait for the process to exit
        self._task_process.join(self._stop_timeout_secs)

When attempting to kill the process:

        # If the process didn't exit, attempt to kill it
        if self._task_process.is_alive():
            logger.log_notice("Attempting to kill sysmon main process with pid {}".format(self._task_process.pid))
            os.kill(self._task_process.pid, signal.SIGKILL)

        if self._task_process.is_alive():
            logger.log_error("Sysmon main process with pid {} could not be killed".format(self._task_process.pid))
            return False

The os.kill will make a zombie process since it hasn't been awaited yet. The is_alive() then returns True. The fix could be to use self._task_process.kill() instead.

The task monitor system bus could not be stopped:

        monitor_system_bus.task_stop()

since it is blocked in MonitorSystemBusTask in loop.run():

        loop = GLib.MainLoop()
        loop.run()

Command line to reproduce:

sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

stepanblyschak · 2024-10-17T10:14:58Z

@prgeor Please review

stepanblyschak · 2024-10-30T07:54:57Z

/azpw run

mssonicbld · 2024-10-30T07:55:00Z

/AzurePipelines run

azure-pipelines · 2024-10-30T07:55:15Z

Azure Pipelines successfully started running 1 pipeline(s).

stepanblyschak · 2024-11-05T16:31:34Z

Seem unrelated tests failed, restarting:

FAIL: pceplib/test/pcep_msg_tests_valgrind.sh
FAIL: pceplib/test/pcep_pcc_api_tests_valgrind.sh
FAIL: pceplib/test/pcep_session_logic_tests_valgrind.sh
FAIL: pceplib/test/pcep_socket_comm_tests_valgrind.sh
FAIL: pceplib/test/pcep_timers_tests_valgrind.sh
FAIL: pceplib/test/pcep_utils_tests_valgrind.sh

stepanblyschak · 2024-11-05T16:31:48Z

/azpw run Azure.sonic-buildimage

mssonicbld · 2024-11-05T16:31:51Z

/AzurePipelines run Azure.sonic-buildimage

azure-pipelines · 2024-11-05T16:32:04Z

Azure Pipelines successfully started running 1 pipeline(s).

stepanblyschak · 2024-11-10T09:23:36Z

Broadcom build failed, restarting:

Step 22/32 : COPY ["files/dsserve", "/usr/bin/"]
COPY failed: file not found in build context or excluded by .dockerignore: stat files/dsserve: file does not exist
[  FAIL LOG END  ] [ target/docker-gbsyncd-broncos.gz ]

stepanblyschak · 2024-11-10T09:23:45Z

/azpw run Azure.sonic-buildimage

mssonicbld · 2024-11-10T09:23:47Z

/AzurePipelines run Azure.sonic-buildimage

azure-pipelines · 2024-11-10T09:23:59Z

Azure Pipelines successfully started running 1 pipeline(s).

stepanblyschak · 2024-11-18T12:50:27Z

/azpw run Azure.sonic-buildimage

mssonicbld · 2024-11-18T12:50:29Z

/AzurePipelines run Azure.sonic-buildimage

azure-pipelines · 2024-11-18T12:50:41Z

Azure Pipelines successfully started running 1 pipeline(s).

stepanblyschak · 2024-11-25T09:41:26Z

Seems like a network issue:

curl: (22) The requested URL returned error: 404
Try 5: /usr/bin/curl --retry-connrefused --retry 5 failed to get: -f
dget: curl ipmitool_1.8.19-4+deb12u1.debian.tar.xz https://deb.debian.org/debian/pool/main/i/ipmitool/ipmitool_1.8.19-4+deb12u1.debian.tar.xz failed
ipmitool_1.8.19-4+deb12u1.dsc:
      Good signature found
   validating ipmitool_1.8.19.orig.tar.gz
   skipping  ipmitool_1.8.19-4+deb12u1.debian.tar.xz (not present)
All files validated successfully.
dpkg-source: error: cannot fstat file ./ipmitool_1.8.19-4+deb12u1.debian.tar.xz: No such file or directory
make[1]: *** [Makefile:9: /sonic/target/debs/bookworm/ipmitool_1.8.19-4+deb12u1_amd64.deb] Error 25

stepanblyschak · 2024-11-25T09:41:34Z

/azpw run Azure.sonic-buildimage

mssonicbld · 2024-11-25T09:41:36Z

/AzurePipelines run Azure.sonic-buildimage

azure-pipelines · 2024-11-25T09:41:50Z

Azure Pipelines successfully started running 1 pipeline(s).

[healthd] fix healthd shutdown race

a8c54ad

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

stepanblyschak requested a review from lguohan as a code owner July 8, 2024 07:41

Junchao-Mellanox approved these changes Jul 10, 2024

View reviewed changes

dprital added Request for 202405 Branch Request for 202311 Branch Request for 202305 Branch labels Jul 21, 2024

dgsudharsan requested a review from prgeor August 26, 2024 16:50

liat-grozovik previously approved these changes Sep 3, 2024

View reviewed changes

vivekrnv closed this Sep 3, 2024

vivekrnv reopened this Sep 3, 2024

stepanblyschak marked this pull request as draft September 16, 2024 15:16

vivekrnv removed Request for 202305 Branch Request for 202311 Branch Request for 202405 Branch labels Sep 17, 2024

prgeor previously approved these changes Oct 15, 2024

View reviewed changes

fix shutdown hang

ade9ec5

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

stepanblyschak dismissed prgeor’s stale review via ade9ec5 October 17, 2024 10:12

stepanblyschak dismissed liat-grozovik’s stale review via ade9ec5 October 17, 2024 10:12

stepanblyschak marked this pull request as ready for review October 17, 2024 10:14

vivekrnv approved these changes Oct 30, 2024

View reviewed changes

prgeor approved these changes Nov 7, 2024

View reviewed changes

qiluo-msft merged commit 56c07d2 into sonic-net:master Dec 2, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[healthd] fix healthd shutdown race #19504

[healthd] fix healthd shutdown race #19504

stepanblyschak commented Jul 8, 2024 •

edited

Loading

liat-grozovik commented Jul 9, 2024

liat-grozovik commented Jul 14, 2024

stepanblyschak commented Jul 21, 2024

mssonicbld commented Jul 21, 2024

azure-pipelines bot commented Jul 21, 2024

vivekrnv commented Aug 30, 2024

liat-grozovik commented Sep 3, 2024

azure-pipelines bot commented Sep 3, 2024

prgeor commented Sep 3, 2024

prgeor commented Oct 15, 2024

prgeor commented Oct 15, 2024

stepanblyschak commented Oct 16, 2024

stepanblyschak commented Oct 16, 2024 •

edited

Loading

stepanblyschak commented Oct 17, 2024

stepanblyschak commented Oct 30, 2024

mssonicbld commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

stepanblyschak commented Nov 5, 2024

stepanblyschak commented Nov 5, 2024

mssonicbld commented Nov 5, 2024

azure-pipelines bot commented Nov 5, 2024

stepanblyschak commented Nov 10, 2024

stepanblyschak commented Nov 10, 2024

mssonicbld commented Nov 10, 2024

azure-pipelines bot commented Nov 10, 2024

stepanblyschak commented Nov 18, 2024

mssonicbld commented Nov 18, 2024

azure-pipelines bot commented Nov 18, 2024

stepanblyschak commented Nov 25, 2024

stepanblyschak commented Nov 25, 2024

mssonicbld commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

[healthd] fix healthd shutdown race #19504

[healthd] fix healthd shutdown race #19504

Conversation

stepanblyschak commented Jul 8, 2024 • edited Loading

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

liat-grozovik commented Jul 9, 2024

liat-grozovik commented Jul 14, 2024

stepanblyschak commented Jul 21, 2024

mssonicbld commented Jul 21, 2024

azure-pipelines bot commented Jul 21, 2024

vivekrnv commented Aug 30, 2024

liat-grozovik commented Sep 3, 2024

azure-pipelines bot commented Sep 3, 2024

prgeor commented Sep 3, 2024

prgeor commented Oct 15, 2024

prgeor commented Oct 15, 2024

stepanblyschak commented Oct 16, 2024

stepanblyschak commented Oct 16, 2024 • edited Loading

stepanblyschak commented Oct 17, 2024

stepanblyschak commented Oct 30, 2024

mssonicbld commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

stepanblyschak commented Nov 5, 2024

stepanblyschak commented Nov 5, 2024

mssonicbld commented Nov 5, 2024

azure-pipelines bot commented Nov 5, 2024

stepanblyschak commented Nov 10, 2024

stepanblyschak commented Nov 10, 2024

mssonicbld commented Nov 10, 2024

azure-pipelines bot commented Nov 10, 2024

stepanblyschak commented Nov 18, 2024

mssonicbld commented Nov 18, 2024

azure-pipelines bot commented Nov 18, 2024

stepanblyschak commented Nov 25, 2024

stepanblyschak commented Nov 25, 2024

mssonicbld commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

stepanblyschak commented Jul 8, 2024 •

edited

Loading

stepanblyschak commented Oct 16, 2024 •

edited

Loading