Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[healthd] fix healthd shutdown race #19504

Merged
merged 2 commits into from
Dec 2, 2024

Conversation

stepanblyschak
Copy link
Collaborator

@stepanblyschak stepanblyschak commented Jul 8, 2024

Why I did it

To fix errors that happen when writing to the queue:

Jun  5 23:04:41.798613 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.798985 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.799535 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.806010 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.814075 r-leopard-56 ERR healthd: system_service[Errno 104] Connection reset by peer
Jun  5 23:04:41.824135 r-leopard-56 ERR healthd: Traceback (most recent call last):#012  File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 484, in system_service#012    msg = self.myQ.get(timeout=QUEUE_TIMEOUT)#012  File "<string>", line 2, in get#012  File "/usr/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod#012    kind, result = conn.recv()#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 255, in recv#012    buf = self._recv_bytes()#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes#012    buf = self._recv(4)#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv#012    chunk = read(handle, remaining)#012ConnectionResetError: [Errno 104] Connection reset by peer
Jun  5 23:04:41.826489 r-leopard-56 INFO healthd[8494]: ERROR:dbus.connection:Exception in handler for D-Bus signal:
Jun  5 23:04:41.826591 r-leopard-56 INFO healthd[8494]: Traceback (most recent call last):
Jun  5 23:04:41.826640 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3/dist-packages/dbus/connection.py", line 232, in maybe_handle_message
Jun  5 23:04:41.826686 r-leopard-56 INFO healthd[8494]:     self._handler(*args, **kwargs)
Jun  5 23:04:41.826738 r-leopard-56 INFO healthd[8494]:   File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 82, in on_job_removed
Jun  5 23:04:41.826785 r-leopard-56 INFO healthd[8494]:     self.task_notify(msg)
Jun  5 23:04:41.826831 r-leopard-56 INFO healthd[8494]:   File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 110, in task_notify
Jun  5 23:04:41.826877 r-leopard-56 INFO healthd[8494]:     self.task_queue.put(msg)
Jun  5 23:04:41.826923 r-leopard-56 INFO healthd[8494]:   File "<string>", line 2, in put
Jun  5 23:04:41.826973 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/managers.py", line 808, in _callmethod
Jun  5 23:04:41.827018 r-leopard-56 INFO healthd[8494]:     conn.send((self._id, methodname, args, kwds))
Jun  5 23:04:41.827065 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 211, in send
Jun  5 23:04:41.827115 r-leopard-56 INFO healthd[8494]:     self._send_bytes(_ForkingPickler.dumps(obj))
Jun  5 23:04:41.827158 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
Jun  5 23:04:41.827199 r-leopard-56 INFO healthd[8494]:     self._send(header + buf)
Jun  5 23:04:41.827254 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 373, in _send
Jun  5 23:04:41.827322 r-leopard-56 INFO healthd[8494]:     n = write(self._handle, buf)
Jun  5 23:04:41.827368 r-leopard-56 INFO healthd[8494]: BrokenPipeError: [Errno 32] Broken pipe
Jun  5 23:04:42.800216 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...

When the multiprocessing.Manager is shutdown the queue will raise the above errors. This happens during shutdown - fast-reboot, warm-reboot.

With the fix, system-health service does not hang:

root@sonic:/home/admin# sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"
Thu Oct 17 01:07:56 PM IDT 2024: Stopping...
Thu Oct 17 01:07:58 PM IDT 2024: Stopped
root@sonic:/home/admin# sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"
Thu Oct 17 01:08:13 PM IDT 2024: Stopping...
Thu Oct 17 01:08:14 PM IDT 2024: Stopped
root@sonic:/home/admin# sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"
Thu Oct 17 01:09:05 PM IDT 2024: Stopping...
Thu Oct 17 01:09:06 PM IDT 2024: Stopped
Work item tracking
  • Microsoft ADO (number only):

How I did it

Remove the call to shutdown, the cleanup will happen automatically when GC runs as per documentation - https://docs.python.org/3/library/multiprocessing.html

How to verify it

Run warm-reboot, fast-reboot multiple times and verify no errors in the log.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202311
  • 202405

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@stepanblyschak stepanblyschak requested a review from lguohan as a code owner July 8, 2024 07:41
@liat-grozovik
Copy link
Collaborator

@Junchao-Mellanox could you please help to review?
@stepanblyschak please followup on the pr checker failure

@liat-grozovik
Copy link
Collaborator

@lguohan @yxieca please help to review and lets for sure take to 202311 and 202405

@stepanblyschak
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vivekrnv
Copy link
Contributor

This change is causing some other side effects,

During the shutdown flow, system-health is taking > 90 sec for it to shutdown.

root@sonic:/home/admin# journalctl -f -u system-health
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ multi-user.target ]
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ getty.target ]
Aug 28 19:56:20 r-bobcat-01 systemd[1]: Stopping system-health.service - SONiC system health monitor...
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[13967]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[14104]: Caught SIGTERM - exiting...
Aug 28 19:56:21 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Attempting to kill sysmon main process with pid 14113
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Sysmon main process with pid 14113 could not be killed
Aug 28 19:56:32 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: State 'final-sigterm' timed out. Killing.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Killing process 14123 (healthd) with signal SIGKILL.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Failed with result 'timeout'.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: Stopped system-health.service - SONiC system health monitor.

I think, because of the shutdown call removal, the msg = self.myQ.get(timeout=QUEUE_TIMEOUT) is blocked for 15 sec. Within 10 sec after the SIGTERM (

os.kill(self._task_process.pid, signal.SIGKILL)
), The SIGKIll is given and ihe sysmon process gets terminated without ever issuing the task_stop for it's child processes causing it them to block the stop flow for 90 sec until the systemd actually issues SIGKILL.

Please note that this is not seen everytime, when the Q has some items during this time, the get call gets returned and it continues without any problem. The probability of repro is 25%

liat-grozovik
liat-grozovik previously approved these changes Sep 3, 2024
@liat-grozovik
Copy link
Collaborator

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vivekrnv vivekrnv closed this Sep 3, 2024
@vivekrnv vivekrnv reopened this Sep 3, 2024
@prgeor
Copy link
Contributor

prgeor commented Sep 3, 2024

@stepanblyschak please address the above issue from Vivek

@prgeor
Copy link
Contributor

prgeor commented Oct 15, 2024

@stepanblyschak PR still in draft?

@prgeor
Copy link
Contributor

prgeor commented Oct 15, 2024

@stepanblyschak fix build errors?

prgeor
prgeor previously approved these changes Oct 15, 2024
@stepanblyschak
Copy link
Collaborator Author

@prgeor Please see the bug report by @vivekrnv. This PR re-introduces another problem, so the fix must handle both - the error that is logged during shutdown and the occasional hang of healthd during shutdown

@stepanblyschak
Copy link
Collaborator Author

stepanblyschak commented Oct 16, 2024

I investigate the issue a bit more and there are a couple of problems:

  1. Shutdown of the queue will cause errors (see PR description) in other sub processes:
        #Clear the resources of mpmgr- Queue
        self.mpmgr.shutdown()
  1. The shutdown logic will wait at most QUEUE_TIMEOUT = 15 sec before it is able to exit from the task loop.
        while not self.task_stopping_event.is_set():
            try:
                msg = self.myQ.get(timeout=QUEUE_TIMEOUT)
                event = msg["unit"]

The queue self.myQ is blocked for QUEUE_TIMEOUT and we cannot detect the self.task_stopping_event.

  1. Joining the process will exit due to timeout since QUEUE_TIMEOUT > self._stop_timeout_secs.
        # Wait for the process to exit
        self._task_process.join(self._stop_timeout_secs)
  1. When attempting to kill the process:
        # If the process didn't exit, attempt to kill it
        if self._task_process.is_alive():
            logger.log_notice("Attempting to kill sysmon main process with pid {}".format(self._task_process.pid))
            os.kill(self._task_process.pid, signal.SIGKILL)

        if self._task_process.is_alive():
            logger.log_error("Sysmon main process with pid {} could not be killed".format(self._task_process.pid))
            return False

The os.kill will make a zombie process since it hasn't been awaited yet. The is_alive() then returns True. The fix could be to use self._task_process.kill() instead.

  1. The task monitor system bus could not be stopped:
        monitor_system_bus.task_stop()

since it is blocked in MonitorSystemBusTask in loop.run():

        loop = GLib.MainLoop()
        loop.run()

Command line to reproduce:

sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@stepanblyschak stepanblyschak marked this pull request as ready for review October 17, 2024 10:14
@stepanblyschak
Copy link
Collaborator Author

@prgeor Please review

@stepanblyschak
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Collaborator Author

Seem unrelated tests failed, restarting:

FAIL: pceplib/test/pcep_msg_tests_valgrind.sh
FAIL: pceplib/test/pcep_pcc_api_tests_valgrind.sh
FAIL: pceplib/test/pcep_session_logic_tests_valgrind.sh
FAIL: pceplib/test/pcep_socket_comm_tests_valgrind.sh
FAIL: pceplib/test/pcep_timers_tests_valgrind.sh
FAIL: pceplib/test/pcep_utils_tests_valgrind.sh

@stepanblyschak
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Collaborator Author

Broadcom build failed, restarting:

Step 22/32 : COPY ["files/dsserve", "/usr/bin/"]
COPY failed: file not found in build context or excluded by .dockerignore: stat files/dsserve: file does not exist
[  FAIL LOG END  ] [ target/docker-gbsyncd-broncos.gz ]

@stepanblyschak
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Collaborator Author

Seems like a network issue:

curl: (22) The requested URL returned error: 404
Try 5: /usr/bin/curl --retry-connrefused --retry 5 failed to get: -f
dget: curl ipmitool_1.8.19-4+deb12u1.debian.tar.xz https://deb.debian.org/debian/pool/main/i/ipmitool/ipmitool_1.8.19-4+deb12u1.debian.tar.xz failed
ipmitool_1.8.19-4+deb12u1.dsc:
      Good signature found
   validating ipmitool_1.8.19.orig.tar.gz
   skipping  ipmitool_1.8.19-4+deb12u1.debian.tar.xz (not present)
All files validated successfully.
dpkg-source: error: cannot fstat file ./ipmitool_1.8.19-4+deb12u1.debian.tar.xz: No such file or directory
make[1]: *** [Makefile:9: /sonic/target/debs/bookworm/ipmitool_1.8.19-4+deb12u1_amd64.deb] Error 25

@stepanblyschak
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft qiluo-msft merged commit 56c07d2 into sonic-net:master Dec 2, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants