Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple_action_server.py can deadlock #4

Closed
tfoote opened this issue Jul 12, 2013 · 11 comments
Closed

simple_action_server.py can deadlock #4

tfoote opened this issue Jul 12, 2013 · 11 comments
Assignees
Labels

Comments

@tfoote
Copy link
Member

tfoote commented Jul 12, 2013

From: http://answers.ros.org/question/50276/python-simpleactionserver-stops-unexpectedly-deadlock/

I wasn't able to get the python debugger to produce the stack traces that explained the problem. But after looking at the roslog debug messages and the source code, I came up with the following reconstruction of the events:

Thread A (rospy's message dispatcher?)

Deadlocked in: self.execute_condition.acquire()
In function: SimpleActionServer.internal_goal_callback() [simple_action_server.py:211] which was called from: ActionServer.internal_goal_callback() [action_server.py:293]
This thread has ActionServer.lock and wants to acquire SimpleActionServer.lock (condition variable was initialised with the latter lock).
Thread B (SimpleActionServer's executeLoop thread)

Deadlocked in: with self.action_server.lock
In function ServerGoalHandle.set_accepted() [server_goal_handle.py:71] which was called from: SimpleActionServer.accept_new_goal()[simple_action_server.py:131] which was called from: SimpleActionServer.executeLoop()[simple_action_server.py:284] which at that point is holding SimpleActionServer.lock
This thread wants ActionServer.lock and has SimpleActionServer.lock
In summary, if if a new goal arrives at the same time executeLoop is trying to get a previous (but still new, SimpleActionServer will deadlock.

I suspect the solution involves calling accept_new_goal() [simple_action_server.py:284] without holding SimpleActionServer.lock. My intuition is that simply setting a flag will do, but I will have to study the code a bit more to make sure there no side-effects.

@dirk-thomas
Copy link
Member

@miguelsdc Can you please provide a complete and reproducible example? The code from the referenced answers.ros.org post seems to be incomplete.

@miguelsdc
Copy link
Contributor

@dirk-thomas see https://github.com/miguelsdc/deadlockTest repository. It has a complete ROS project (rosbuild, haven't yet got the hang of catkin) which will make actionlib deadlock.

I know the code seems too demanding on actionlib and not representative of a real workload (eg. high frequency of requests, remote task is not really useful). As it turns out, getting one thread to accept an already-exisiting goal at the same time as the other thread is processing a newly arrived goal doesn't happen easily. The code in the repo can do deadlock reliably (particularly with the brutal launch file). Moreover, for some reason, using naoqi+nao_controller.py, also triggers the deadlock (which is how I originally cam across this issue).

Unfortunately, deadlockTest also deadlocks even after the proposed patch #5 is applied. There are 3 possible reasons: 1) deadlockTest is badly coded, 2) the patch is badly coded, 3) there's another potential deadlock in simple_action_server.py.

I will investigate further tomorrow...

@miguelsdc
Copy link
Contributor

As it turns out, deadlockTest unconvered other race conditions. #5 has been updated to fix these race conditions.

As an informal test, roslaunching deadlockTestBrutal results in a deadlock in less than 5 seconds on my machine. After applying the patch, my machine was happily running for 15min...

Hope that helps,
Miguel S.

@dirk-thomas
Copy link
Member

I ran you example an reviewed the changes and can confirm that it fixes the deadlock.

But I am a bit worried about the removed lock in the executeLoop(). I do see why it was necessary to avoid locking in the wrong order but I think that enables new race conditions (but I could not modify your example yet to expose the cases I suspect exist).

These actions can be safely performed without locking; at worst, we wait for an iteration to process a new goal

I am actually not sure to agree to this statement in the patch. I think worse things can happen than just skipping a cycle. One example would result in error messages due to not locking around self.is_active() anymore.

@miguelsdc
Copy link
Contributor

Okay I've been re-reading the source code of simple_action_server.py and I still think there's no need to lock that area in executeLoop(). Let me explain why.

There are two functions that can potentially cause race conditions: self.is_active() and self.is_new_goal_available().

Regarding the first one: in polling implementations there's no problem because executeLoop() is never called. In callback implementations I'm assuming all self.set_aborted/preempted/succeeded() should be called by self.execute_callback(goal). If the user forgets to set the status, then the goal is set to aborted (and the user is warned). I've searched through the code and found no instance of any other functions setting the state of the current goal.

As far as I can tell, the only calls to change the outcome of set_active() are made by the executeLoop() thread. So no race conditions on the first one, unless I'm missing something?.

Similarly, for self.is_new_goal_available() there could be a serious error if self.new_goal was set to False between executeLoop()'s self.is_new_goal_available() and self.accept_new_goal(). However a quick search of the source code reveals that self.new_goal is only set to False by self.accept_new_goal(). Ergo, no race conditions :)

Nevertheless just needing that much text to explain that code is okay (assuming I haven't missed anything) is clearly far from ideal, and could cause trouble should some of the assumptions stop being valid in the future.

I see three options going forward:

  1. Update comment to contain a summarised version of what I said above and warn about implicit assumptions so that new race conditions do not appear down the road.
  2. Simply lock the critical piece of code with a with self.action_server.lock, self.lock: statement and leave it as it was before (I originally had it like this, but I was concerned it may be too long a time locked for action_server.py)
  3. Do completely away with self.lock and just use self.action_server.lock instead.

I reckon we should only take option 1. if you don't find any issues with the patch as it is now.

Option 2. is the easiest to implement and to be honest I don't think the performance penalty is going to be that significant.

Option 3. may seem stupid, but as the code stands whenever you have self.lock you end up having self.action_server.lock too (the only case where this is not explicit is on internal_goal_callback() and internal_preempt_callback() but I'm pretty sure those functions are called from within critical sections of action_server.py anyway). Option 3 has the advantage that deadlocks are no longer possible and makes the code more readable.

Which one should we go for? Or do you have any other ideas?

@dirk-thomas
Copy link
Member

I wrapped my head around it one more time. I still think that there are potential race conditions related to is_active and is_new_goal_available being called outside a lock. But the worst that should happen in this case should be a warning/error message on the console.

Therefore I think option 2 or 3 would be over-locking.

I have created an updated pull request which mainly adds comments and removes the useless shall_run variable`. Does that still look good to you?

@dirk-thomas
Copy link
Member

It would be very valuable to have your deadlock test as an actual unit test in actionlib. Could you convert it into a unittest using GTest to ensure that this will not break again in the future?

@miguelsdc
Copy link
Contributor

The updated pull request looks great! Indeed, the comments are more useful and the code in executeLoop() is clearer.

I think it may be a good idea to cherry pick the patch in the fuerte, groovy and hydro branches. (All distros currently share the same simple_action_server.py file after all). Even more so since fuerte's EOL is fast approaching. What do you think?

I'll convert deadlock test into an actual unit test (but using python's unittest if that's okay) as create a new pull request for it shortly.

@dirk-thomas
Copy link
Member

Great, thank you.

@miguelsdc
Copy link
Contributor

Just added a new pull request with the proposed test (see description in #8).

@ghost ghost assigned dirk-thomas Aug 5, 2013
@dirk-thomas
Copy link
Member

Resolved with #6 and covered by test in #9.

severin-lemaignan referenced this issue in severin-lemaignan/robotpkg Aug 18, 2014
Changes since 1.9.12:

1.11.2 (2014-05-20)
-------------------
* Update python publishers to define queue_size.
* Use the correct queue for processing MessageEvents
* Contributors: Esteve Fernandez, Michael Ferguson, Nican

1.11.1 (2014-05-08)
-------------------
* Fix uninitialised execute_thread_ member pointer
* Make rostest in CMakeLists optional
* Use catkin_install_python() to install Python scripts
* Contributors: Dirk Thomas, Esteve Fernandez, Jordi Pages, Lukas Bulwahn

1.11.0 (2014-02-13)
-------------------
* replace usage of __connection_header with MessageEvent (`#20
<https://github.com/ros/actionlib/issues/20>`_)

1.10.3 (2013-08-27)
-------------------
* Merged pull request `#15 <https://github.com/ros/actionlib/issues/15>`_
  Fixes a compile issue for actionlib headers on OS X

1.10.2 (2013-08-21)
-------------------
* separating ActionServer implementation into base class and
ros-publisher-based class (`#11 <https://github.com/ros/actionlib/issues/11>`_)
* support CATKIN_ENABLE_TESTING
* add isValid to ServerGoalHandle (`#14
<https://github.com/ros/actionlib/issues/14>`_)
* make operators const (`#10 <https://github.com/ros/actionlib/issues/10>`_)
* add counting of connections to avoid reconnect problem when callbacks are
invoked in different order (`#7 <https://github.com/ros/actionlib/issues/7>`_)
* fix deadlock in simple_action_server.py (`#4
<https://github.com/ros/actionlib/issues/4>`_)
* fix missing runtime destination for library (`#3
<https://github.com/ros/actionlib/issues/3>`_)

1.10.1 (2013-06-06)
-------------------
* fix location of library before installation (`#1
<https://github.com/ros/actionlib/issues/1>`_)

1.10.0 (2013-04-11)
-------------------
* define DEPRECATED only if not defined already
* modified dependency type of catkin to buildtool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants