Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use recursive_mutex to allow callbacks to reenter #123

Open
wants to merge 1 commit into
base: rolling
Choose a base branch
from

Conversation

emersonknapp
Copy link
Contributor

Test added in ros2/rcl#1081 hangs without this change, because the mutex is held to call the user callback, so if the user then tries to do just about any RMW API call within the callback, there is a reentrant deadlock. It's probably not a great idea to do this workflow, but the API documentation does not discourage it, and it probably shouldn't cause a deadlock. It should either work, or return an error. This change makes it work.

Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
@emersonknapp emersonknapp changed the title Use recursive_mutex to allow callbacks to reenter [WIP] Use recursive_mutex to allow callbacks to reenter Jun 22, 2023
@emersonknapp emersonknapp marked this pull request as ready for review July 10, 2023 21:05
@emersonknapp emersonknapp changed the title [WIP] Use recursive_mutex to allow callbacks to reenter Use recursive_mutex to allow callbacks to reenter Jul 10, 2023
@asorbini
Copy link
Collaborator

Thank you for this contribution @emersonknapp, looks good to me.

It makes sense that the service_callback would be allowed to invoke (at least some) APIs on the service. As you point out, it's a bit of a "risky workflow" to enable (because of potential future synchronization nightmares...), and we might have to think through some of the corner cases that could arise from allowing these re-entrant API calls.

I reviewed the deadlock with your new test, and as I suspected it happens on Connext's receive thread. For "documentation purposes", here's the backtrace:

0  futex_wait (private=0, expected=2, futex_word=0x555e16a67cb8) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x555e16a67cb8, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f65ae3a7082 in lll_mutex_lock_optimized (mutex=0x555e16a67cb8) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x555e16a67cb8) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007f65ae2843e7 in RMW_Connext_SubscriberStatusCondition::set_data_available(bool) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#5  0x00007f65ae27a6f9 in RMW_Connext_Subscriber::take_next(void**, rmw_message_info_s*, unsigned long, unsigned long*, bool, PRESInstanceHandle const*) ()
   from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#6  0x00007f65ae27b06a in RMW_Connext_Subscriber::take_message(void*, rmw_message_info_s*, bool*, PRESInstanceHandle const*) ()
   from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#7  0x00007f65ae27df63 in RMW_Connext_Service::take_request(rmw_service_info_s*, void*, bool*) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#8  0x00007f65ae92bdc2 in rcl_take_request_with_info () from /workspace-test/install/rcl/lib/librcl.so
#9  0x00007f65ae92c196 in rcl_take_request () from /workspace-test/install/rcl/lib/librcl.so
#10 0x0000555e15ba4323 in service_callback(void const*, unsigned long) ()
#11 0x00007f65ae287666 in RMW_Connext_SubscriberStatusCondition::notify_new_data() () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#12 0x00007f65ae28772d in RMW_Connext_DataReaderListener_on_data_available(void*, DDS_DataReaderImpl*) ()
   from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#13 0x00007f65ada3f945 in DDS_DataReaderListener_forward_onDataAvailable () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#14 0x00007f65ada48117 in DDS_DataReader_impl_forward_onDataAvailable () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#15 0x00007f65ad016a22 in PRESPsService_readerNotifyOfReaderQueueChanges () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#16 0x00007f65ad018751 in PRESPsService_readerSampleListenerOnNewData () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#17 0x00007f65ad14da0f in COMMENDSrReaderService_onSubmessage () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#18 0x00007f65ad19873d in MIGInterpreter_parse () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#19 0x00007f65ad126973 in COMMENDActiveFacadeReceiver_loop () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#20 0x00007f65ad2ce465 in RTIOsapiThreadChild_onSpawned () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#21 0x00007f65ae3a3b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#22 0x00007f65ae434bb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

The deadlock happens because of the call to RMW_Connext_SubscriberStatusCondition::set_data_available() inside a call to RMW_Connext_SubscriberStatusCondition::notify_new_data().

I think the PR can be merged with green CI (just to be extra careful for regressions, since the new test to exercise it hasn't been merged yet):

  • Linux: Build Status
  • Windows: Build Status

@asorbini
Copy link
Collaborator

I have to unfortunately retract some of my previous comment after having investigated the changes a bit further.

I tried to verify the fix using the new test from ros2/rcl#1081 but the test was still running into a similar deadlock inside RMW_Connext_SubscriberStatusCondition::set_data_available(), but this time caused by RMW_Connext_WaitSet::mutex_internal (inside RMW_Connext_Condition::update_state called from from here).

Unfortunately this mutex cannot be made recursive because it's used to protect a condition variable.

I tried to make that culprit lock_guard optional by detecting a "recursive" lock from within a "notification", but I ended up discovering another, more fundamental and much harder to work around deadlock because it pertains the Connext DataReader's internal critical section.

The deadlock occurs while the main() thread is inside wait_for_service_to_be_ready() and a Connext receive thread is inside service_callback(), calling rcl_take_request().

The main() thread attaches the service to a waitset, which eventually leads to the following stack trace, where the thread owns RMW_Connext_Subscriber::loan_mutex, and it is blocked waiting to enter the DataReader's critical section:

#4  0x00007fb1b90d1139 in RTIOsapiSemaphore_take () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#5  0x00007fb1b90b1ace in REDAWorker_enterExclusiveArea () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#6  0x00007fb1b90b4465 in REDACursor_modifyReadWriteArea () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#7  0x00007fb1b8df8d21 in PRESPsReader_readOrTakeInstanceI () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#8  0x00007fb1b8df982c in PRESPsReader_takeInstance () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#9  0x00007fb1b984c4d3 in DDS_DataReader_read_or_take_instance_untypedI () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#10 0x00007fb1b9fd549d in rmw_connextdds_take_samples(RMW_Connext_Subscriber*) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#11 0x00007fb1b9f9f4f8 in RMW_Connext_Subscriber::loan_messages(bool) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#12 0x00007fb1b9faba17 in RMW_Connext_WaitSet::detach(rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, unsigned long&) ()
   from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#13 0x00007fb1b9fac22f in RMW_Connext_WaitSet::wait(rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_time_s const*) ()
   from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#14 0x00007fb1ba65d94f in rcl_wait () from /workspace-test/install/rcl/lib/librcl.so

Meanwhile, the Connext receive thread own the DataReader's critical section (to notify of new data available), and it is blocked inside service_callback, trying to acquire RMW_Connext_Subscriber::loan_mutex:

#4  0x00007fb1b9fa06db in RMW_Connext_Subscriber::take_next(void**, rmw_message_info_s*, unsigned long, unsigned long*, bool, PRESInstanceHandle const*) ()
   from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#5  0x00007fb1b9fa10aa in RMW_Connext_Subscriber::take_message(void*, rmw_message_info_s*, bool*, PRESInstanceHandle const*) ()
   from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#6  0x00007fb1b9fa3fa3 in RMW_Connext_Service::take_request(rmw_service_info_s*, void*, bool*) () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#7  0x00007fb1ba651dc2 in rcl_take_request_with_info () from /workspace-test/install/rcl/lib/librcl.so
#8  0x00007fb1ba652196 in rcl_take_request () from /workspace-test/install/rcl/lib/librcl.so
#9  0x000055a082a8c323 in service_callback(void const*, unsigned long) ()
#10 0x00007fb1b9fad7a7 in RMW_Connext_SubscriberStatusCondition::notify_new_data() () from /workspace/install/rmw_connextdds_common/lib/librmw_connextdds_common_pro.so
#11 0x00007fb1b983f945 in DDS_DataReaderListener_forward_onDataAvailable () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#12 0x00007fb1b9848117 in DDS_DataReader_impl_forward_onDataAvailable () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddsc.so
#13 0x00007fb1b8e16a22 in PRESPsService_readerNotifyOfReaderQueueChanges () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#14 0x00007fb1b8e18751 in PRESPsService_readerSampleListenerOnNewData () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#15 0x00007fb1b8f4da0f in COMMENDSrReaderService_onSubmessage () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#16 0x00007fb1b8f9873d in MIGInterpreter_parse () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#17 0x00007fb1b8f26973 in COMMENDActiveFacadeReceiver_loop () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so
#18 0x00007fb1b90ce465 in RTIOsapiThreadChild_onSpawned () from /opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0/libnddscore.so

Because of this discovery, I think allowing the call to rcl_take_request to occur from within a callback cannot be easily supported without some major refactoring of the current implementation (specifically, but likely not limited to, how samples are pulled from the DDS DataReader's history and notified to the upper layers).

@spiderkeys
Copy link
Collaborator

spiderkeys commented Oct 6, 2023

@emersonknapp not being familiar with direct usage of the RCL API, how does this affect/restrict typical user interactions with services when using rclcpp? Is the failure mode described here something that is encountered by rcl/rclcpp/some RMW layer in their internal code paths (and thus an issue for anyone who generally uses services), or is this only an issue if you specifically exercise the sequence of calls within your usercode?

so if the user then tries to do just about any RMW API call within the callback, there is a reentrant deadlock.

From this it sounds like code like this would be safe, as long as no other RMW calls are made within the callback lambda?

  trigger_service_ =
    this->create_service<my_msgs::srv::TriggerAction>(
      topics::TRIGGER_SERVICE, 
      [this](const std::shared_ptr<my_msgs::srv::TriggerAction::Request> request,
            std::shared_ptr<my_msgs::srv::TriggerAction::Response> response) 
      {
           // Safe if no RMW calls in this scope?
           
           // Set some response field and return...
           response->response.result.message = "hello";
      },
      1);

@emersonknapp
Copy link
Contributor Author

It means you can't publish messages within a service callback, or so on. That's not a recommended workflow for various reasons, but it should probably not deadlock forever.

I had been working on an internal service for rclcpp::Node (around here that had to do some unusual mixed-rcl/rclcpp calls, but ended up implementing a different way so this stopped blocking my development there.

It just brought this up as a potential concern. Maybe recursive_mutex isn't the fully correct answer, it just definitely isn't correct deadlocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants