-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xcvrd should restart if any child thread crashes #326
Conversation
Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@keboliu can you review |
self.exc = e | ||
self.main_thread_stop_event.set() | ||
|
||
def raise_exception(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add comment here why this exception is raised here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment now.
@mihirpat1 can you check the behavior if you kill the threads with |
@jaganbal-a @shyam77git please review |
Signed-off-by: Mihir Patel <patelmi@microsoft.com>
It seems we won't be able to kill threads using |
What if "CmisManagerTask " is disabled with intention? On the Mellanox platforms "CmisManagerTask" is disabled by setting the "skip_xcvrd_cmis_mgr" to "true". |
If "CmisManagerTask" is disabled with intention, we will not spawn thread for it. Also, I have attached the UT for it ( |
@@ -2199,9 +2247,10 @@ class DaemonXcvrd(daemon_base.DaemonBase): | |||
def __init__(self, log_identifier, skip_cmis_mgr=False): | |||
super(DaemonXcvrd, self).__init__(log_identifier) | |||
self.stop_event = threading.Event() | |||
self.sfp_error_event = multiprocessing.Event() | |||
self.sfp_error_event = threading.Event() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sfp_error_event event seems not handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we are using it at the below:
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L2391-L2392
if self.sfp_error_event.is_set():
sys.exit(SFP_SYSTEM_ERROR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that code will never get hit, when the event get set due to error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, since we are killing the parent process (xcvrd) if next_state is STATE_EXIT, the above code will never get hit. This seems like a day 1 behavior though.
threading.Thread.__init__(self) | ||
self.name = "CmisManagerTask" | ||
self.exc = None | ||
self.task_stopping_event = threading.Event() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain how this event created inside the thread will be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This event will be used to stop the handling the task_worker routine (in case of exception handling or graceful shutdown of the thread). The task_worker routine will be running in a while loop waiting for task_stopping_event to be set in addition to performing necessary actions.
Hence, once we want the thread to exit, we must stop the execution of task_worker routine so that it comes out of the while loop and allows us to proceed with exception handling or graceful shutdown.
|
||
# Stop the sfp state info update thread | ||
if sfp_state_update.is_alive(): | ||
sfp_state_update.raise_exception() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be good to join(1) with timeout and raise exception if timeout expires and the thread is still alive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's better to raise exception right away since based on my current analysis on Arista testbed, it takes about ~5s for the vendor API to come out of sleep. This timeout I assume could be dynamic based on the transceivers present on the device so I think having 1s delay will not give us a much advantage for join function with 1s timeout.
…daemons into xcvr_thread_support
…sed to be skipped. Also, moidified join function to handle accordingly. Added logs for showing names of threads spawned Signed-off-by: Mihir Patel <patelmi@microsoft.com>
* Xcvrd should restart if any child thread crashes Signed-off-by: Mihir Patel <patelmi@microsoft.com> * Resolved test_SfpStateUpdateTask_task_run_stop test failure * Added comment for raise_exception Signed-off-by: Mihir Patel <patelmi@microsoft.com> * Added check for avoiding cmis_manager.start() if CMIS thread is supposed to be skipped. Also, moidified join function to handle accordingly. Added logs for showing names of threads spawned Signed-off-by: Mihir Patel <patelmi@microsoft.com> Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@StormLiangMS please pick to 202211 |
* Xcvrd should restart if any child thread crashes Signed-off-by: Mihir Patel <patelmi@microsoft.com> * Resolved test_SfpStateUpdateTask_task_run_stop test failure * Added comment for raise_exception Signed-off-by: Mihir Patel <patelmi@microsoft.com> * Added check for avoiding cmis_manager.start() if CMIS thread is supposed to be skipped. Also, moidified join function to handle accordingly. Added logs for showing names of threads spawned Signed-off-by: Mihir Patel <patelmi@microsoft.com> Signed-off-by: Mihir Patel <patelmi@microsoft.com>
Description
Make all task workers as child thread of main process.
The main process should watch for any child thread crash and do suicide to restart itself. Also ensure the backtrace of crashing child thread is shown
Motivation and Context
Overall, the change set now adheres to a template for creating thread and handles exceptions followed by restarting xcvrd.
With this change set, 3 threads will be created:
All the above threads have now been named accordingly.
Once an exception is raised, the run() method will generate a traceback and will set the main_thread_stop event.
The xcvrd daemon keeps waiting until main_thread_stop event is generated. Once this event is generated, we first check if any thread is dead. If so, we capture the exception, evaluate if other threads have an exception generated as well and then force the xcvrd process to kill itself.
In case all threads are alive, we allow xcvrd daemon to gracefully exit.
Note - In order to kill SfpStateUpdateTask thread gracefully, we are manually raising an exception since SfpStateUpdateTask has a call to an API which could potentially sleep in the order of seconds and hence, could block the xcvrd daemon graceful shutdown process for a prolonged time.
How Has This Been Tested?
Please find the summary below.
1. Kill xcvrd
2. Restart xcvrd
3. Restart pmon
4. Kill 1 thread and ensure all threads are killed and xcvrd is respawned
5. Ensure backtrace is generated with a manual exception - CmisManagerTask (Please refer to the below note for details)
6. Ensure backtrace is generated with a manual exception - DomInfoUpdateTask (Please refer to the below note for details)
7. Ensure backtrace is generated with a manual exception - SfpStateUpdateTask (Please refer to the below note for details)
8. Ensure on Mellanox, no side effect is seen since they do not spawn CMIS thread
For a detailed UT, please refer to the below enclosure.
xcvrd_thread_support.txt
Additional Information (Optional)
Signed-off-by: Mihir Patel patelmi@microsoft.com