Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor could take more than once incorrectly #383

Merged
merged 7 commits into from
Jun 19, 2018

Conversation

wjwwood
Copy link
Member

@wjwwood wjwwood commented Oct 4, 2017

@sloretz pointed this issue out to me when he was working on the Python executor. This is an issue that only occurs when the multi-threaded executor and reentrant callback groups are used together.

If you imagine each thread in the multi-threaded executor doing this:

  • acquire lock
  • wait for work
  • release lock
  • take (or "claim") work (rcl_take for subscribers, call for timers)
  • pass work to user callback
  • loop

If a context switch happens directly after releasing the lock, it's possible another thread can take the same work. This is because of how the "wait for work" part works, which is to:

  • for each callback group, if it can be taken from, loop over its items
    • items include subscriptions, timers, and client/servers for services
  • add each item to the wait set
  • wait
  • return the next thing that is ready when wait wakes up

So if all of this happens in the other thread again, before the original thread can take or claim the work to be done, then a second thread would try to take or claim the same work.

For subscriptions and clients/servers of services this is protected by some form of rcl_take which can fail if called twice on the same subscription and only one piece of data is available. However timers could be called multiple times, since whether or not a timer should be called and calling it are decoupled.

In this pr I was able to expose this issue by adding an option for the multithreaded executor to yield after releasing the lock and using a test which has a reentrant callback group and timer. Reproducing it with a subscription would be a good bit harder I think due to the asynchronous nature of the process.

I'm opening this as a work in progress for visibility, but I'm on the right track to having a reproducible test failure.

For the fix, which has not been made yet, it will require taking or claiming the work to be done inside of the multithreading lock, which might require a change to the executor API.

@wjwwood wjwwood added in progress Actively being worked on (Kanban column) bug Something isn't working labels Oct 4, 2017
@mikaelarguedas mikaelarguedas added ready Work is about to start (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Oct 26, 2017
@mjcarroll mjcarroll self-assigned this Mar 27, 2018
@mjcarroll mjcarroll added in progress Actively being worked on (Kanban column) and removed ready Work is about to start (Kanban column) labels Mar 27, 2018
@mjcarroll mjcarroll force-pushed the fix_executor_extra_take branch 2 times, most recently from a3a15ca to 1488317 Compare March 28, 2018 17:43
@mjcarroll mjcarroll changed the title executor could take more than once incorrectly wip: executor could take more than once incorrectly Mar 28, 2018
@mjcarroll mjcarroll changed the title wip: executor could take more than once incorrectly executor could take more than once incorrectly Mar 28, 2018
@mjcarroll mjcarroll force-pushed the fix_executor_extra_take branch from 1488317 to c94c152 Compare March 28, 2018 18:13
@mjcarroll
Copy link
Member

@wjwwood and @mikaelarguedas This is a fix, I don't know if it's the one that we want.

As described above, the bug occurs if Executor::get_next_timer gets called multiple times before the timer actually gets executed in Executor::execute_any_executable. To prevent this, I've added a set of timers to track if a timer is currently 'in-flight' or scheduled for execution.

The timer is added to the set in get_next_timer and removed in execute_any_executable.

The alternative would be to include this functionality in the timer itself, but I'm not sure if that would be a violation of encapsulation?

@wjwwood
Copy link
Member Author

wjwwood commented Mar 28, 2018

I spoke with @mjcarroll off-line and we discussed putting the "tracking" information in the MultiThreadedExecutor class since it is not needed in the SingleThreadedExecutor, and also because future executors might want to make similar, but different constraints on what should and should not be returned by the get_next_executable() function.

@mjcarroll mjcarroll force-pushed the fix_executor_extra_take branch from 73c408b to 071e658 Compare May 2, 2018 15:00
@mjcarroll mjcarroll force-pushed the fix_executor_extra_take branch from 071e658 to 38c48d1 Compare June 19, 2018 00:49
@mjcarroll
Copy link
Member

Shallow CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Contributor

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One cmake think I think needs to be fixed, and me not understanding some code in another comment.

@@ -0,0 +1,91 @@
// Copyright 2017 Open Source Robotics Foundation, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018

@@ -387,6 +387,14 @@ if(BUILD_TESTING)
"rcl")
target_link_libraries(test_utilities ${PROJECT_NAME})
endif()

ament_add_gtest(test_multi_threaded_executor test/executors/test_multi_threaded_executor.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be wrapped in if(UNIX) due to sched_setscheduler() in the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an #ifdef around the corresponding sched_setscheduler(), because the bug originally manifested on OSX, so we wanted to test against that. That trick additionally isn't necessary on OSX because the bug shows up with the default scheduler.

double diff = labs((now - last).nanoseconds()) / 1.0e9;
last = now;

if (diff < 0.009 || diff > 0.011) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend replacing numbers with something likediff < PERIOD - TOLERANCE || diff > PERIOD + TOLERANCE to make it clear where they come from.

}

rclcpp::executors::MultiThreadedExecutor executor(
rclcpp::executor::create_default_executor_arguments(), 0, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference is yield_before_execute = true; then passing that in. I'll approve either way.

{
std::lock_guard<std::mutex> lock(scheduled_mutex_);
auto it = scheduled_.find(any_exec);
if (it != scheduled_.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is an executable not found in scheduled_? Is it a case worth logging a warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should never happen, afaict.

continue;
}
{
std::lock_guard<std::mutex> lock(scheduled_mutex_);
if (scheduled_.count(any_exec) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this line. AFAIK count() uses operator==() of any_exec which is std::shared_ptr<executor::AnyExecutable>. The docs say that compares the address of the pointer held by the shared_ptr (link here). The address is of a newly default constructed object in std::make_shared() above. Won't this count() always return 0?

@mjcarroll
Copy link
Member

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@mjcarroll
Copy link
Member

Now with expanded test tolerance:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Member Author

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@wjwwood
Copy link
Member Author

wjwwood commented Jun 19, 2018

@dhood is going to give a review as well.

@mjcarroll mjcarroll added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Jun 19, 2018
Otherwise it seemed to me like it would yield twice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants