Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Connext is very slow to shutdown #325

Closed
clalancette opened this issue Dec 14, 2018 · 7 comments · Fixed by #473
Closed

Connext is very slow to shutdown #325

clalancette opened this issue Dec 14, 2018 · 7 comments · Fixed by #473
Assignees
Labels
enhancement New feature or request

Comments

@clalancette
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Ubuntu 16.04 AMD64
  • Installation type:
    • Source
  • Version or commit hash:
  • DDS implementation:
    • RTI Connext
  • Client library (if applicable):
    • rclpy-ish

Steps to reproduce issue

RMW_IMPLEMENTATION=rmw_connext_cpp ros2 run demo_nodes_cpp talker
(hit Ctrl-C here)

Expected behavior

Talker starts up, publishes some data, then quickly goes away when the user hits Ctrl-C.

Actual behavior

Talker starts up, publishes some data, then takes at least 3 seconds (sometimes longer) to go away after the user hits Ctrl-C.

Additional information

This seems to get worse with the number of nodes in the process. For the composition demos, for instance, it almost seems like it takes 3-5 seconds for each node loaded into the process.

@jwillemsen
Copy link
Contributor

We have seen this also when testing with AXCIOMA, deleting a domain participant can take some time with RTI Connext DDS

@clalancette clalancette added the enhancement New feature or request label Dec 19, 2018
@clalancette
Copy link
Contributor Author

From @brawner:

The rclcpp unit tests take significantly longer for rmw_connext_cpp than other middleware implementations.

Compare ConnextDDS unit tests with rclcpp (11 minutes)
http://build.ros2.org/view/Rci/job/Rci__nightly-connext_ubuntu_focal_amd64/32/testReport/rclcpp/

vs FastRTPS (1 min 30 s):
http://build.ros2.org/view/Rci/job/Rci__nightly-fastrtps_ubuntu_focal_amd64/34/testReport/rclcpp/

I've narrowed this down to the destruction of the rclcpp::Node, a call to destruct a waitable takes the vast majority of this time. I believe this is happening in AllocatorMemoryStrategy, but since it uses shared_ptrs, the actual destruct may be elsewhere.
If I put steady_clock measurements around this line, or even if I iterate through the vector and call reset explicitly, there is one waitable that will consume significant amounts of time to delete. This isn't for all waitables, just one default one that I can't figure out where it's coming from.

@clalancette
Copy link
Contributor Author

@brawner I've assigned you to this for any follow-up work to determine exactly where the destruction slowness is coming from. If it turns out to be in Connext code, and there is nothing we can do about it, then I'll suggest opening a Known Issue on https://index.ros.org/doc/ros2/Releases/Release-Rolling-Ridley/ (and maybe the other distributions as well).

@brawner
Copy link

brawner commented Oct 21, 2020

It appears deleting the participant is the main hangup here. This line takes all 3.18 seconds of node destruction.

DDS::ReturnCode_t local_ret = dpf_->delete_participant(participant);

A couple of forum posts, but I'm not sure they help me understand the issue any better.
https://community.rti.com/forum-topic/ddstheparticipantfactory-deleteparticipant-hangs
https://community.rti.com/forum-topic/deleteparticipant-never-returns

@clalancette
Copy link
Contributor Author

A couple of forum posts, but I'm not sure they help me understand the issue any better.
https://community.rti.com/forum-topic/ddstheparticipantfactory-deleteparticipant-hangs
https://community.rti.com/forum-topic/deleteparticipant-never-returns

There's one thing in there that is potentially helpful; it seems like the delays can come because delete_participant is waiting for threads to complete. If we can figure out how to make those threads complete faster, then it could be speeded up.

However, since we don't have the source to Connext, figuring that out may be tricky.

@neil-rti
Copy link
Contributor

Hi Chris - I'm not certain if this has already been explored, but there is a QoS setting for shutdown period:
https://community.rti.com/forum-topic/3-seconds-shutdown-domainparticipant
has the Q&A and links to the QoS documentation.

jacobperron added a commit to ros2/launch_ros that referenced this issue Oct 28, 2020
Resolves #184.

With some RMW implementations, like rmw_connext_cpp, it takes a bit longer to initialize or shutdown ROS nodes.

Related issue: ros2/rmw_connext#325

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit to ros2/launch_ros that referenced this issue Oct 28, 2020
Resolves #184.

With some RMW implementations, like rmw_connext_cpp, it takes a bit longer to initialize or shutdown ROS nodes.

Related issue: ros2/rmw_connext#325

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
@clalancette
Copy link
Contributor Author

Hi Chris - I'm not certain if this has already been explored, but there is a QoS setting for shutdown period:
https://community.rti.com/forum-topic/3-seconds-shutdown-domainparticipant
has the Q&A and links to the QoS documentation.

I didn't know about that one. Thanks for the heads-up, I'll give it a try and see if it makes a difference to this problem.

jacobperron added a commit to ros2/launch_ros that referenced this issue Nov 20, 2020
Resolves #184.

With some RMW implementations, like rmw_connext_cpp, it takes a bit longer to initialize or shutdown ROS nodes.

Related issue: ros2/rmw_connext#325

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants