-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor could take more than once incorrectly #383
Conversation
a3a15ca
to
1488317
Compare
1488317
to
c94c152
Compare
@wjwwood and @mikaelarguedas This is a fix, I don't know if it's the one that we want. As described above, the bug occurs if The timer is added to the set in The alternative would be to include this functionality in the timer itself, but I'm not sure if that would be a violation of encapsulation? |
I spoke with @mjcarroll off-line and we discussed putting the "tracking" information in the |
73c408b
to
071e658
Compare
071e658
to
38c48d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One cmake think I think needs to be fixed, and me not understanding some code in another comment.
@@ -0,0 +1,91 @@ | |||
// Copyright 2017 Open Source Robotics Foundation, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018
@@ -387,6 +387,14 @@ if(BUILD_TESTING) | |||
"rcl") | |||
target_link_libraries(test_utilities ${PROJECT_NAME}) | |||
endif() | |||
|
|||
ament_add_gtest(test_multi_threaded_executor test/executors/test_multi_threaded_executor.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be wrapped in if(UNIX)
due to sched_setscheduler()
in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an #ifdef
around the corresponding sched_setscheduler(), because the bug originally manifested on OSX, so we wanted to test against that. That trick additionally isn't necessary on OSX because the bug shows up with the default scheduler.
double diff = labs((now - last).nanoseconds()) / 1.0e9; | ||
last = now; | ||
|
||
if (diff < 0.009 || diff > 0.011) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend replacing numbers with something likediff < PERIOD - TOLERANCE || diff > PERIOD + TOLERANCE
to make it clear where they come from.
} | ||
|
||
rclcpp::executors::MultiThreadedExecutor executor( | ||
rclcpp::executor::create_default_executor_arguments(), 0, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference is yield_before_execute = true;
then passing that in. I'll approve either way.
{ | ||
std::lock_guard<std::mutex> lock(scheduled_mutex_); | ||
auto it = scheduled_.find(any_exec); | ||
if (it != scheduled_.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is an executable not found in scheduled_
? Is it a case worth logging a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should never happen, afaict.
continue; | ||
} | ||
{ | ||
std::lock_guard<std::mutex> lock(scheduled_mutex_); | ||
if (scheduled_.count(any_exec) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this line. AFAIK count()
uses operator==()
of any_exec
which is std::shared_ptr<executor::AnyExecutable>
. The docs say that compares the address of the pointer held by the shared_ptr
(link here). The address is of a newly default constructed object in std::make_shared()
above. Won't this count()
always return 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@dhood is going to give a review as well. |
Otherwise it seemed to me like it would yield twice.
@sloretz pointed this issue out to me when he was working on the Python executor. This is an issue that only occurs when the multi-threaded executor and reentrant callback groups are used together.
If you imagine each thread in the multi-threaded executor doing this:
rcl_take
for subscribers, call for timers)If a context switch happens directly after releasing the lock, it's possible another thread can take the same work. This is because of how the "wait for work" part works, which is to:
So if all of this happens in the other thread again, before the original thread can take or claim the work to be done, then a second thread would try to take or claim the same work.
For subscriptions and clients/servers of services this is protected by some form of
rcl_take
which can fail if called twice on the same subscription and only one piece of data is available. However timers could be called multiple times, since whether or not a timer should be called and calling it are decoupled.In this pr I was able to expose this issue by adding an option for the multithreaded executor to yield after releasing the lock and using a test which has a reentrant callback group and timer. Reproducing it with a subscription would be a good bit harder I think due to the asynchronous nature of the process.
I'm opening this as a work in progress for visibility, but I'm on the right track to having a reproducible test failure.
For the fix, which has not been made yet, it will require taking or claiming the work to be done inside of the multithreading lock, which might require a change to the executor API.