simple_action_server.py can deadlock #4

tfoote · 2013-07-12T21:18:58Z

From: http://answers.ros.org/question/50276/python-simpleactionserver-stops-unexpectedly-deadlock/

I wasn't able to get the python debugger to produce the stack traces that explained the problem. But after looking at the roslog debug messages and the source code, I came up with the following reconstruction of the events:

Thread A (rospy's message dispatcher?)

Deadlocked in: self.execute_condition.acquire()
In function: SimpleActionServer.internal_goal_callback() [simple_action_server.py:211] which was called from: ActionServer.internal_goal_callback() [action_server.py:293]
This thread has ActionServer.lock and wants to acquire SimpleActionServer.lock (condition variable was initialised with the latter lock).
Thread B (SimpleActionServer's executeLoop thread)

Deadlocked in: with self.action_server.lock
In function ServerGoalHandle.set_accepted() [server_goal_handle.py:71] which was called from: SimpleActionServer.accept_new_goal()[simple_action_server.py:131] which was called from: SimpleActionServer.executeLoop()[simple_action_server.py:284] which at that point is holding SimpleActionServer.lock
This thread wants ActionServer.lock and has SimpleActionServer.lock
In summary, if if a new goal arrives at the same time executeLoop is trying to get a previous (but still new, SimpleActionServer will deadlock.

I suspect the solution involves calling accept_new_goal() [simple_action_server.py:284] without holding SimpleActionServer.lock. My intuition is that simply setting a flag will do, but I will have to study the code a bit more to make sure there no side-effects.

dirk-thomas · 2013-07-15T23:34:06Z

@miguelsdc Can you please provide a complete and reproducible example? The code from the referenced answers.ros.org post seems to be incomplete.

miguelsdc · 2013-07-16T18:10:56Z

@dirk-thomas see https://github.com/miguelsdc/deadlockTest repository. It has a complete ROS project (rosbuild, haven't yet got the hang of catkin) which will make actionlib deadlock.

I know the code seems too demanding on actionlib and not representative of a real workload (eg. high frequency of requests, remote task is not really useful). As it turns out, getting one thread to accept an already-exisiting goal at the same time as the other thread is processing a newly arrived goal doesn't happen easily. The code in the repo can do deadlock reliably (particularly with the brutal launch file). Moreover, for some reason, using naoqi+nao_controller.py, also triggers the deadlock (which is how I originally cam across this issue).

Unfortunately, deadlockTest also deadlocks even after the proposed patch #5 is applied. There are 3 possible reasons: 1) deadlockTest is badly coded, 2) the patch is badly coded, 3) there's another potential deadlock in simple_action_server.py.

I will investigate further tomorrow...

miguelsdc · 2013-07-17T16:30:02Z

As it turns out, deadlockTest unconvered other race conditions. #5 has been updated to fix these race conditions.

As an informal test, roslaunching deadlockTestBrutal results in a deadlock in less than 5 seconds on my machine. After applying the patch, my machine was happily running for 15min...

Hope that helps,
Miguel S.

dirk-thomas · 2013-07-30T00:58:04Z

I ran you example an reviewed the changes and can confirm that it fixes the deadlock.

But I am a bit worried about the removed lock in the executeLoop(). I do see why it was necessary to avoid locking in the wrong order but I think that enables new race conditions (but I could not modify your example yet to expose the cases I suspect exist).

These actions can be safely performed without locking; at worst, we wait for an iteration to process a new goal

I am actually not sure to agree to this statement in the patch. I think worse things can happen than just skipping a cycle. One example would result in error messages due to not locking around self.is_active() anymore.

miguelsdc · 2013-07-30T11:35:18Z

Okay I've been re-reading the source code of simple_action_server.py and I still think there's no need to lock that area in executeLoop(). Let me explain why.

There are two functions that can potentially cause race conditions: self.is_active() and self.is_new_goal_available().

Regarding the first one: in polling implementations there's no problem because executeLoop() is never called. In callback implementations I'm assuming all self.set_aborted/preempted/succeeded() should be called by self.execute_callback(goal). If the user forgets to set the status, then the goal is set to aborted (and the user is warned). I've searched through the code and found no instance of any other functions setting the state of the current goal.

As far as I can tell, the only calls to change the outcome of set_active() are made by the executeLoop() thread. So no race conditions on the first one, unless I'm missing something?.

Similarly, for self.is_new_goal_available() there could be a serious error if self.new_goal was set to False between executeLoop()'s self.is_new_goal_available() and self.accept_new_goal(). However a quick search of the source code reveals that self.new_goal is only set to False by self.accept_new_goal(). Ergo, no race conditions :)

Nevertheless just needing that much text to explain that code is okay (assuming I haven't missed anything) is clearly far from ideal, and could cause trouble should some of the assumptions stop being valid in the future.

I see three options going forward:

Update comment to contain a summarised version of what I said above and warn about implicit assumptions so that new race conditions do not appear down the road.
Simply lock the critical piece of code with a with self.action_server.lock, self.lock: statement and leave it as it was before (I originally had it like this, but I was concerned it may be too long a time locked for action_server.py)
Do completely away with self.lock and just use self.action_server.lock instead.

I reckon we should only take option 1. if you don't find any issues with the patch as it is now.

Option 2. is the easiest to implement and to be honest I don't think the performance penalty is going to be that significant.

Option 3. may seem stupid, but as the code stands whenever you have self.lock you end up having self.action_server.lock too (the only case where this is not explicit is on internal_goal_callback() and internal_preempt_callback() but I'm pretty sure those functions are called from within critical sections of action_server.py anyway). Option 3 has the advantage that deadlocks are no longer possible and makes the code more readable.

Which one should we go for? Or do you have any other ideas?

dirk-thomas · 2013-07-31T21:26:03Z

I wrapped my head around it one more time. I still think that there are potential race conditions related to is_active and is_new_goal_available being called outside a lock. But the worst that should happen in this case should be a warning/error message on the console.

Therefore I think option 2 or 3 would be over-locking.

I have created an updated pull request which mainly adds comments and removes the useless shall_run variable`. Does that still look good to you?

dirk-thomas · 2013-07-31T21:37:24Z

It would be very valuable to have your deadlock test as an actual unit test in actionlib. Could you convert it into a unittest using GTest to ensure that this will not break again in the future?

miguelsdc · 2013-08-01T14:18:46Z

The updated pull request looks great! Indeed, the comments are more useful and the code in executeLoop() is clearer.

I think it may be a good idea to cherry pick the patch in the fuerte, groovy and hydro branches. (All distros currently share the same simple_action_server.py file after all). Even more so since fuerte's EOL is fast approaching. What do you think?

I'll convert deadlock test into an actual unit test (but using python's unittest if that's okay) as create a new pull request for it shortly.

dirk-thomas · 2013-08-01T16:51:18Z

Great, thank you.

miguelsdc · 2013-08-04T13:56:11Z

Just added a new pull request with the proposed test (see description in #8).

dirk-thomas · 2013-08-05T23:00:34Z

Resolved with #6 and covered by test in #9.

Changes since 1.9.12: 1.11.2 (2014-05-20) ------------------- * Update python publishers to define queue_size. * Use the correct queue for processing MessageEvents * Contributors: Esteve Fernandez, Michael Ferguson, Nican 1.11.1 (2014-05-08) ------------------- * Fix uninitialised execute_thread_ member pointer * Make rostest in CMakeLists optional * Use catkin_install_python() to install Python scripts * Contributors: Dirk Thomas, Esteve Fernandez, Jordi Pages, Lukas Bulwahn 1.11.0 (2014-02-13) ------------------- * replace usage of __connection_header with MessageEvent (`#20 <https://github.com/ros/actionlib/issues/20>`_) 1.10.3 (2013-08-27) ------------------- * Merged pull request `#15 <https://github.com/ros/actionlib/issues/15>`_ Fixes a compile issue for actionlib headers on OS X 1.10.2 (2013-08-21) ------------------- * separating ActionServer implementation into base class and ros-publisher-based class (`#11 <https://github.com/ros/actionlib/issues/11>`_) * support CATKIN_ENABLE_TESTING * add isValid to ServerGoalHandle (`#14 <https://github.com/ros/actionlib/issues/14>`_) * make operators const (`#10 <https://github.com/ros/actionlib/issues/10>`_) * add counting of connections to avoid reconnect problem when callbacks are invoked in different order (`#7 <https://github.com/ros/actionlib/issues/7>`_) * fix deadlock in simple_action_server.py (`#4 <https://github.com/ros/actionlib/issues/4>`_) * fix missing runtime destination for library (`#3 <https://github.com/ros/actionlib/issues/3>`_) 1.10.1 (2013-06-06) ------------------- * fix location of library before installation (`#1 <https://github.com/ros/actionlib/issues/1>`_) 1.10.0 (2013-04-11) ------------------- * define DEPRECATED only if not defined already * modified dependency type of catkin to buildtool

miguelsdc mentioned this issue Jul 15, 2013

Fix deadlock in simple_action_server.py #5

Closed

dirk-thomas added a commit that referenced this issue Jul 31, 2013

fix deadlock in simple_action_server.py (fix #4)

4fdcc75

miguelsdc mentioned this issue Aug 4, 2013

Python Simple Action Server Deadlock Test #8

Closed

ghost assigned dirk-thomas Aug 5, 2013

dirk-thomas closed this as completed Aug 5, 2013

dirk-thomas added a commit that referenced this issue Aug 5, 2013

fix deadlock in simple_action_server.py (fix #4)

ab25df9

lubiluk mentioned this issue Jul 20, 2017

Crashing on mutex lock #85

Closed

scopeccsky mentioned this issue Nov 4, 2019

Segmentation Fault when I run client node #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple_action_server.py can deadlock #4

simple_action_server.py can deadlock #4

tfoote commented Jul 12, 2013

dirk-thomas commented Jul 15, 2013

miguelsdc commented Jul 16, 2013

miguelsdc commented Jul 17, 2013

dirk-thomas commented Jul 30, 2013

miguelsdc commented Jul 30, 2013

dirk-thomas commented Jul 31, 2013

dirk-thomas commented Jul 31, 2013

miguelsdc commented Aug 1, 2013

dirk-thomas commented Aug 1, 2013

miguelsdc commented Aug 4, 2013

dirk-thomas commented Aug 5, 2013

simple_action_server.py can deadlock #4

simple_action_server.py can deadlock #4

Comments

tfoote commented Jul 12, 2013

dirk-thomas commented Jul 15, 2013

miguelsdc commented Jul 16, 2013

miguelsdc commented Jul 17, 2013

dirk-thomas commented Jul 30, 2013

miguelsdc commented Jul 30, 2013

dirk-thomas commented Jul 31, 2013

dirk-thomas commented Jul 31, 2013

miguelsdc commented Aug 1, 2013

dirk-thomas commented Aug 1, 2013

miguelsdc commented Aug 4, 2013

dirk-thomas commented Aug 5, 2013