-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[healthd] fix healthd shutdown race #19504
[healthd] fix healthd shutdown race #19504
Conversation
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@Junchao-Mellanox could you please help to review? |
/azpw run Azure.sonic-buildimage |
/AzurePipelines run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
This change is causing some other side effects, During the shutdown flow, system-health is taking > 90 sec for it to shutdown.
I think, because of the shutdown call removal, the
Please note that this is not seen everytime, when the Q has some items during this time, the get call gets returned and it continues without any problem. The probability of repro is 25% |
/azp run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
@stepanblyschak please address the above issue from Vivek |
@stepanblyschak PR still in draft? |
@stepanblyschak fix build errors? |
I investigate the issue a bit more and there are a couple of problems:
#Clear the resources of mpmgr- Queue
self.mpmgr.shutdown()
while not self.task_stopping_event.is_set():
try:
msg = self.myQ.get(timeout=QUEUE_TIMEOUT)
event = msg["unit"] The queue
# Wait for the process to exit
self._task_process.join(self._stop_timeout_secs)
# If the process didn't exit, attempt to kill it
if self._task_process.is_alive():
logger.log_notice("Attempting to kill sysmon main process with pid {}".format(self._task_process.pid))
os.kill(self._task_process.pid, signal.SIGKILL)
if self._task_process.is_alive():
logger.log_error("Sysmon main process with pid {} could not be killed".format(self._task_process.pid))
return False The
since it is blocked in MonitorSystemBusTask in loop = GLib.MainLoop()
loop.run() Command line to reproduce:
|
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@prgeor Please review |
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
Seem unrelated tests failed, restarting:
|
/azpw run Azure.sonic-buildimage |
/AzurePipelines run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
Broadcom build failed, restarting:
|
/azpw run Azure.sonic-buildimage |
/AzurePipelines run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-buildimage |
/AzurePipelines run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
Seems like a network issue:
|
/azpw run Azure.sonic-buildimage |
/AzurePipelines run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
Why I did it
To fix errors that happen when writing to the queue:
When the multiprocessing.Manager is shutdown the queue will raise the above errors. This happens during shutdown - fast-reboot, warm-reboot.
With the fix, system-health service does not hang:
Work item tracking
How I did it
Remove the call to shutdown, the cleanup will happen automatically when GC runs as per documentation - https://docs.python.org/3/library/multiprocessing.html
How to verify it
Run warm-reboot, fast-reboot multiple times and verify no errors in the log.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)