-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simple_action_server.py can deadlock #4
Comments
@miguelsdc Can you please provide a complete and reproducible example? The code from the referenced answers.ros.org post seems to be incomplete. |
@dirk-thomas see https://github.com/miguelsdc/deadlockTest repository. It has a complete ROS project (rosbuild, haven't yet got the hang of catkin) which will make actionlib deadlock. I know the code seems too demanding on actionlib and not representative of a real workload (eg. high frequency of requests, remote task is not really useful). As it turns out, getting one thread to accept an already-exisiting goal at the same time as the other thread is processing a newly arrived goal doesn't happen easily. The code in the repo can do deadlock reliably (particularly with the brutal launch file). Moreover, for some reason, using naoqi+nao_controller.py, also triggers the deadlock (which is how I originally cam across this issue). Unfortunately, deadlockTest also deadlocks even after the proposed patch #5 is applied. There are 3 possible reasons: 1) deadlockTest is badly coded, 2) the patch is badly coded, 3) there's another potential deadlock in simple_action_server.py. I will investigate further tomorrow... |
As it turns out, deadlockTest unconvered other race conditions. #5 has been updated to fix these race conditions. As an informal test, roslaunching deadlockTestBrutal results in a deadlock in less than 5 seconds on my machine. After applying the patch, my machine was happily running for 15min... Hope that helps, |
I ran you example an reviewed the changes and can confirm that it fixes the deadlock. But I am a bit worried about the removed lock in the
I am actually not sure to agree to this statement in the patch. I think worse things can happen than just skipping a cycle. One example would result in error messages due to not locking around |
Okay I've been re-reading the source code of simple_action_server.py and I still think there's no need to lock that area in There are two functions that can potentially cause race conditions: Regarding the first one: in polling implementations there's no problem because As far as I can tell, the only calls to change the outcome of Similarly, for Nevertheless just needing that much text to explain that code is okay (assuming I haven't missed anything) is clearly far from ideal, and could cause trouble should some of the assumptions stop being valid in the future. I see three options going forward:
I reckon we should only take option 1. if you don't find any issues with the patch as it is now. Option 2. is the easiest to implement and to be honest I don't think the performance penalty is going to be that significant. Option 3. may seem stupid, but as the code stands whenever you have Which one should we go for? Or do you have any other ideas? |
I wrapped my head around it one more time. I still think that there are potential race conditions related to Therefore I think option 2 or 3 would be over-locking. I have created an updated pull request which mainly adds comments and removes the useless |
It would be very valuable to have your deadlock test as an actual unit test in actionlib. Could you convert it into a unittest using GTest to ensure that this will not break again in the future? |
The updated pull request looks great! Indeed, the comments are more useful and the code in I think it may be a good idea to cherry pick the patch in the fuerte, groovy and hydro branches. (All distros currently share the same simple_action_server.py file after all). Even more so since fuerte's EOL is fast approaching. What do you think? I'll convert deadlock test into an actual unit test (but using python's unittest if that's okay) as create a new pull request for it shortly. |
Great, thank you. |
Just added a new pull request with the proposed test (see description in #8). |
Changes since 1.9.12: 1.11.2 (2014-05-20) ------------------- * Update python publishers to define queue_size. * Use the correct queue for processing MessageEvents * Contributors: Esteve Fernandez, Michael Ferguson, Nican 1.11.1 (2014-05-08) ------------------- * Fix uninitialised execute_thread_ member pointer * Make rostest in CMakeLists optional * Use catkin_install_python() to install Python scripts * Contributors: Dirk Thomas, Esteve Fernandez, Jordi Pages, Lukas Bulwahn 1.11.0 (2014-02-13) ------------------- * replace usage of __connection_header with MessageEvent (`#20 <https://github.com/ros/actionlib/issues/20>`_) 1.10.3 (2013-08-27) ------------------- * Merged pull request `#15 <https://github.com/ros/actionlib/issues/15>`_ Fixes a compile issue for actionlib headers on OS X 1.10.2 (2013-08-21) ------------------- * separating ActionServer implementation into base class and ros-publisher-based class (`#11 <https://github.com/ros/actionlib/issues/11>`_) * support CATKIN_ENABLE_TESTING * add isValid to ServerGoalHandle (`#14 <https://github.com/ros/actionlib/issues/14>`_) * make operators const (`#10 <https://github.com/ros/actionlib/issues/10>`_) * add counting of connections to avoid reconnect problem when callbacks are invoked in different order (`#7 <https://github.com/ros/actionlib/issues/7>`_) * fix deadlock in simple_action_server.py (`#4 <https://github.com/ros/actionlib/issues/4>`_) * fix missing runtime destination for library (`#3 <https://github.com/ros/actionlib/issues/3>`_) 1.10.1 (2013-06-06) ------------------- * fix location of library before installation (`#1 <https://github.com/ros/actionlib/issues/1>`_) 1.10.0 (2013-04-11) ------------------- * define DEPRECATED only if not defined already * modified dependency type of catkin to buildtool
From: http://answers.ros.org/question/50276/python-simpleactionserver-stops-unexpectedly-deadlock/
I wasn't able to get the python debugger to produce the stack traces that explained the problem. But after looking at the roslog debug messages and the source code, I came up with the following reconstruction of the events:
Thread A (rospy's message dispatcher?)
Deadlocked in: self.execute_condition.acquire()
In function: SimpleActionServer.internal_goal_callback() [simple_action_server.py:211] which was called from: ActionServer.internal_goal_callback() [action_server.py:293]
This thread has ActionServer.lock and wants to acquire SimpleActionServer.lock (condition variable was initialised with the latter lock).
Thread B (SimpleActionServer's executeLoop thread)
Deadlocked in: with self.action_server.lock
In function ServerGoalHandle.set_accepted() [server_goal_handle.py:71] which was called from: SimpleActionServer.accept_new_goal()[simple_action_server.py:131] which was called from: SimpleActionServer.executeLoop()[simple_action_server.py:284] which at that point is holding SimpleActionServer.lock
This thread wants ActionServer.lock and has SimpleActionServer.lock
In summary, if if a new goal arrives at the same time executeLoop is trying to get a previous (but still new, SimpleActionServer will deadlock.
I suspect the solution involves calling accept_new_goal() [simple_action_server.py:284] without holding SimpleActionServer.lock. My intuition is that simply setting a flag will do, but I will have to study the code a bit more to make sure there no side-effects.
The text was updated successfully, but these errors were encountered: