-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use recursive_mutex to allow callbacks to reenter #123
base: rolling
Are you sure you want to change the base?
Conversation
Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
Thank you for this contribution @emersonknapp, looks good to me. It makes sense that the I reviewed the deadlock with your new test, and as I suspected it happens on Connext's receive thread. For "documentation purposes", here's the backtrace:
The deadlock happens because of the call to I think the PR can be merged with green CI (just to be extra careful for regressions, since the new test to exercise it hasn't been merged yet): |
I have to unfortunately retract some of my previous comment after having investigated the changes a bit further. I tried to verify the fix using the new test from ros2/rcl#1081 but the test was still running into a similar deadlock inside Unfortunately this mutex cannot be made recursive because it's used to protect a condition variable. I tried to make that culprit The deadlock occurs while the The #4 0x00007fb1b90d1139 in RTIOsapiSemaphore_take () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#5 0x00007fb1b90b1ace in REDAWorker_enterExclusiveArea () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#6 0x00007fb1b90b4465 in REDACursor_modifyReadWriteArea () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#7 0x00007fb1b8df8d21 in PRESPsReader_readOrTakeInstanceI () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#8 0x00007fb1b8df982c in PRESPsReader_takeInstance () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#9 0x00007fb1b984c4d3 in DDS_DataReader_read_or_take_instance_untypedI () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#10 0x00007fb1b9fd549d in rmw_connextdds_take_samples(RMW_Connext_Subscriber*) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#11 0x00007fb1b9f9f4f8 in RMW_Connext_Subscriber::loan_messages(bool) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#12 0x00007fb1b9faba17 in RMW_Connext_WaitSet::detach(rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, unsigned long&) ()
from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#13 0x00007fb1b9fac22f in RMW_Connext_WaitSet::wait(rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_time_s const*) ()
from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#14 0x00007fb1ba65d94f in rcl_wait () from /workspace-test/install/rcl/lib/librcl.so Meanwhile, the Connext receive thread own the DataReader's critical section (to notify of new data available), and it is blocked inside #4 0x00007fb1b9fa06db in RMW_Connext_Subscriber::take_next(void**, rmw_message_info_s*, unsigned long, unsigned long*, bool, PRESInstanceHandle const*) ()
from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#5 0x00007fb1b9fa10aa in RMW_Connext_Subscriber::take_message(void*, rmw_message_info_s*, bool*, PRESInstanceHandle const*) ()
from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#6 0x00007fb1b9fa3fa3 in RMW_Connext_Service::take_request(rmw_service_info_s*, void*, bool*) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#7 0x00007fb1ba651dc2 in rcl_take_request_with_info () from /workspace-test/install/rcl/lib/librcl.so
#8 0x00007fb1ba652196 in rcl_take_request () from /workspace-test/install/rcl/lib/librcl.so
#9 0x000055a082a8c323 in service_callback(void const*, unsigned long) ()
#10 0x00007fb1b9fad7a7 in RMW_Connext_SubscriberStatusCondition::notify_new_data() () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#11 0x00007fb1b983f945 in DDS_DataReaderListener_forward_onDataAvailable () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#12 0x00007fb1b9848117 in DDS_DataReader_impl_forward_onDataAvailable () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#13 0x00007fb1b8e16a22 in PRESPsService_readerNotifyOfReaderQueueChanges () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#14 0x00007fb1b8e18751 in PRESPsService_readerSampleListenerOnNewData () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#15 0x00007fb1b8f4da0f in COMMENDSrReaderService_onSubmessage () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#16 0x00007fb1b8f9873d in MIGInterpreter_parse () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#17 0x00007fb1b8f26973 in COMMENDActiveFacadeReceiver_loop () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#18 0x00007fb1b90ce465 in RTIOsapiThreadChild_onSpawned () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so Because of this discovery, I think allowing the call to |
@emersonknapp not being familiar with direct usage of the RCL API, how does this affect/restrict typical user interactions with services when using rclcpp? Is the failure mode described here something that is encountered by rcl/rclcpp/some RMW layer in their internal code paths (and thus an issue for anyone who generally uses services), or is this only an issue if you specifically exercise the sequence of calls within your usercode?
From this it sounds like code like this would be safe, as long as no other RMW calls are made within the callback lambda? trigger_service_ =
this->create_service<my_msgs::srv::TriggerAction>(
topics::TRIGGER_SERVICE,
[this](const std::shared_ptr<my_msgs::srv::TriggerAction::Request> request,
std::shared_ptr<my_msgs::srv::TriggerAction::Response> response)
{
// Safe if no RMW calls in this scope?
// Set some response field and return...
response->response.result.message = "hello";
},
1); |
It means you can't publish messages within a service callback, or so on. That's not a recommended workflow for various reasons, but it should probably not deadlock forever. I had been working on an internal service for It just brought this up as a potential concern. Maybe |
Test added in ros2/rcl#1081 hangs without this change, because the mutex is held to call the user callback, so if the user then tries to do just about any RMW API call within the callback, there is a reentrant deadlock. It's probably not a great idea to do this workflow, but the API documentation does not discourage it, and it probably shouldn't cause a deadlock. It should either work, or return an error. This change makes it work.