Isolate timer interface #447

gezp · 2021-08-19T05:17:18Z

Signed-off-by: zhenpeng ge zhenpeng.ge@qq.com

As described in #446, CreateTimerROS create timers by using application's Node, there are some potential issue when application has some long running callback of services, etc.

In this PR, a dedicated callback group and executor with thread are used to isolate TF timer so that timers won't be blocked by application's callback.

ahcorde

There are some style issues and some tests are failing https://build.ros2.org/job/Rpr__geometry2__ubuntu_focal_amd64/313/

SteveMacenski · 2021-08-19T16:22:23Z

tf2_ros_test_message_filter and test_message_filter tests failed (in addition to linting)

gezp · 2021-08-19T23:18:30Z

I have fixed these issues, please review again.

SteveMacenski · 2021-08-20T03:38:21Z

tf2_ros/src/create_timer_ros.cpp

-  rclcpp::node_interfaces::NodeTimersInterface::SharedPtr node_timers)
-: node_base_(node_base), node_timers_(node_timers), next_timer_handle_index_(0)
+  rclcpp::node_interfaces::NodeTimersInterface::SharedPtr node_timers,
+  bool spin_thread)


spin_thread doesn't seem quite right here. Since this has to do specifically with the timer processing, I think a different name here would be more appropriate so its more clear to users what this actually is for (e.g. allows the timer interface to be processed independently from the base_node application processing).

could you give me some suggestions for naming ? how about is_independent ？

I'm not really sure actually. I'd be interested what the maintainers here think (or if they feel its even worth parameterizing or should just always be enabled)

SteveMacenski · 2021-08-20T03:39:13Z

tf2_ros/src/create_timer_ros.cpp

+      rclcpp::CallbackGroupType::MutuallyExclusive, false);
+    executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
+    executor_->add_callback_group(callback_group_, node_base_);
+    dedicated_thread_ = std::make_unique<std::thread>([&]() {executor_->spin();});


Question for maintainers: Would it make more sense to you to only have this thread running when in use (e.g. when we create_timer create this thread as well and when we reset / cancel it we destroy it then?)

Yes, we we shouldn't start a thread if it's going to be unused. That will cause overhead and make debugging generally harder. Especially if there ends up potentially being several instances.

SteveMacenski · 2021-08-25T00:56:16Z

@ahcorde @tfoote can you give this another look? Zhenpeng's summer project has a limited timeline and getting this and #442 handled would let us get alot more done in Nav2 with the remainder of his time in the summer program.

Beyond the comments I already left, I approve this approach.

gezp · 2021-09-04T03:22:57Z

please review this PR, i need your suggestion, thank you! @clalancette @tfoote.

SteveMacenski · 2021-09-16T21:14:22Z

@ahcorde can you give this a look?

tfoote

Thinking about the threading, if you cancel the callback the thread should also stop. And while looking at that. I see that you've added cleanup for the thread on destruction. However it doesn't appear that the timer gets cancelled on destruction which I think is a potential lifecycle issue as the callback refers to this but if this destructs that will be a use after free bug.

At first I was thinking that this was just going to be for internal usage and then the internal thread used for subscribing should work. But this is for a user callback. And as such we should not create inaccessible callback groups and extra threads. We should expose to the user the ability to associate the timer callback with a callback group of their choice. If they do not associate the callback with a group, they will get the current behavior of using the default callback group. If the user cares about this not blocking they have the option then to use a different default callback group setting or to create a custom callback group. This leaves the user in charge of the threading and provides them with all the standard threading tools that are consistent with the rest of the system instead of a single dimensional boolean value.

tfoote · 2021-09-20T08:03:47Z

tf2_ros/src/create_timer_ros.cpp

+      rclcpp::CallbackGroupType::MutuallyExclusive, false);
+    executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
+    executor_->add_callback_group(callback_group_, node_base_);
+    dedicated_thread_ = std::make_unique<std::thread>([&]() {executor_->spin();});


Yes, we we shouldn't start a thread if it's going to be unused. That will cause overhead and make debugging generally harder. Especially if there ends up potentially being several instances.

gezp · 2021-09-20T09:33:47Z

@tfoote thanks for your suggestions, i have updated this PR, please review again.

ros-discourse · 2021-09-20T20:53:04Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2021-9-16/22372/1

ahcorde

can we add a test ?

gezp · 2021-09-21T14:51:31Z

sorry, i couldn't find a test file about CreateTimerROS , so i don't know how to do for test, can you give me more details about it? @ahcorde

tfoote

Thanks that's much cleaner passing through the callback groups.

SteveMacenski · 2021-09-28T00:00:17Z

So @gezp, what would the implementation of this look like if we wanted to isolate this in Nav2? Would we need to make a new callback group to pass to it? That works, but isn't very clean, since I think this is the behavior that should be default, otherwise its a little buried to users that TF2's timer is being run on their node's executor.

If they do not associate the callback with a group, they will get the current behavior of using the default callback group.

I believe the opposite is better, but agree that exposing it to the user is necessary. I think by default we should construct a callback group for use internally, but if a user does not want isolated behavior, then they can pass in their own default / other callback group.

That would still leave the user in charge of the threading if they wanted to but then not trip up TF2 with user application code execution models.

gezp · 2021-09-28T12:50:47Z

what would the implementation of this look like if we wanted to isolate this in Nav2? Would we need to make a new callback group to pass to it?

yes, if we want to isolate CreateTimerROS , we need to create a new callback group in Nav2 node, and pass it to CreateTimerROS.

tfoote · 2021-09-28T23:23:55Z

otherwise its a little buried to users that TF2's timer is being run on their node's executor.

It's not buried. Every callback that you register uses the nodes default executor unless you specify otherwise.

I think by default we should construct a callback group for use internally, but if a user does not want isolated behavior, then they can pass in their own default / other callback group.

For consistency we shouldn't create an extra thread (not just a callback queue is required for isolating) just for TF timer callbacks. That's not how we work with any other callbacks. And I don't see why this should appear by default in a different callback queue than any other time callback that you register. There's no inherently blocking call to within the library or anything else that differentiates this callback for a tf2 timer from any other callback or timer on the system all of which will be blocked by the long running callback. The case that this is solving is that the user/developer has both registered a long running callback and then wants a quick firing callback to be run in parallel. To do that the standard answer is to change the executor to be multi threaded, or put the callback in different queues so they can be serviced simultaneously. The tf2 library is designed to be a library that can be accessed from any thread and doesn't change or add any complexity to the users' threading model. Adding a TF2 timer callback shouldn't automatically add a thread to your system.

Adding a thread by default will solve this case, but will make all the other cases more complex. In this case it's very clear, if you run a long running callback your timers won't fire until you return, unless you enable multi-threading in one way or another. If these timer callback were to fire by default in an isolated thread, for consistency I would want all the timer callbacks to fire on isolated threads. At which point we're now significantly increasing the complexity of the default simplest system. And we would have the option to overide the complex system to return it to a simple single threadable system. But that's counter to our design goals of keeping the default situation simple.

SteveMacenski · 2021-09-29T18:00:23Z

I think its a matter of preference. I respect that viewpoint. My thoughts on it is that utilities like TF should fully encapsulate, by default, its execution (but be override-able) such that users don't have to think about it in their application code. Keeping the application code using utility libraries as clear as possible is my usual aim.

Every callback that you register uses the nodes default executor unless you specify otherwise.

I intellectually don't disagree with that, but I'm not sure on average how many users really understand what TF is doing behind the scenes to have that appreciation when they're simply copy-pasting code around to get a TF Buffer / Listener to work in their project. They're not trying to use CreateTimerROS (or the actual API of createTimer which registers the callback for them to even be aware of, that's all internal to TF), they're trying to use TF2. They're registering this timer interface for TF to use, but I doubt many people are really tracking where the timer interface object is being used or how.

For example, below is a typical snippet of code for using TF in a ROS 2 application. It's not immediately clear to anyone what the timer interface is doing, or how it might impact their system's execution. I would argue it is pretty hidden from a user that calling waitForTransform (or something that call it) within a callback with the default executor could have a future that times out for no other reason than TF is sharing execution with the application. That's a pretty subtle issue to debug, especially if it doesn't always happen due to the transforms being published at irregular or not-fast rates.

  tf_ = std::make_shared<tf2_ros::Buffer>(get_clock());
  auto timer_interface = std::make_shared<tf2_ros::CreateTimerROS>(
    get_node_base_interface(),
    get_node_timers_interface());
  tf_->setCreateTimerInterface(timer_interface);
  transform_listener_ = std::make_shared<tf2_ros::TransformListener>(*tf_);

The case that this is solving is that the user/developer has both registered a long running callback and then wants a quick firing callback to be run in parallel. To do that the standard answer is to change the executor to be multi threaded, or put the callback in different queues so they can be serviced simultaneously.

No disagreement intellectually to that, but it requires a user of TF to be able to get to a point that they're even aware that this is the issue at hand. Some design advice / exposure / documentation to that respect would go a long way (or by default hiding those details from a user). I don't believe the thread needs to be running at all times, just created "in-time" when a timer is created and destroyed when the timer is canceled.

But I respect your position -- just wanting to give you a full accounting. This can be solved by documentation (and internal warnings within TF if it detects that this is what is happening) as well. If we don't enable default behavior that hides this concern from them, I would recommend having a warning if we detect lock-up on that timer/future so that people can immediately know what's happening and what might be a recommended course of action to resolve.

SteveMacenski · 2021-10-12T20:15:55Z

So what's the word here, can we merge this?

ros-discourse · 2021-11-01T20:52:32Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2021-10-28/22947/1

SteveMacenski · 2021-11-18T22:58:49Z

@ahcorde @clalancette what's the good word?

ros-discourse · 2021-11-22T22:12:08Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2021-11-18/23209/1

tfoote · 2022-01-05T01:12:50Z

@ahcorde the changes are passing CI and significantly cleaned up. I've rerequested a review from you.

SteveMacenski · 2022-01-21T01:18:23Z

@ahcorde

What's the good word?

gezp · 2022-03-24T01:38:31Z

@ahcorde could you have a time to review this PR?

Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com> fix some issues Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com> update Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com>

ahcorde · 2022-03-24T18:16:45Z

Linux
Linux-aarch64
Windows

ahcorde requested changes Aug 19, 2021

View reviewed changes

gezp requested a review from ahcorde August 19, 2021 08:14

SteveMacenski suggested changes Aug 20, 2021

View reviewed changes

ahcorde requested a review from clalancette August 24, 2021 13:59

tfoote reviewed Sep 20, 2021

View reviewed changes

ahcorde requested changes Sep 21, 2021

View reviewed changes

tfoote approved these changes Sep 27, 2021

View reviewed changes

clalancette self-assigned this Nov 18, 2021

tfoote requested a review from ahcorde January 5, 2022 01:11

gezp mentioned this pull request Jan 21, 2022

Get rid of service client nodes ros-navigation/navigation2#816

Closed

gezp mentioned this pull request Mar 24, 2022

Reduce amcl_node nodes ros-navigation/navigation2#2483

Merged

6 tasks

use dedicated callback group and executor to isolate timer

dab507d

Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com> fix some issues Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com> update Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com>

ahcorde approved these changes Mar 24, 2022

View reviewed changes

ahcorde merged commit bbd5d56 into ros2:ros2 Mar 24, 2022

clalancette mentioned this pull request Apr 29, 2022

ROS 2 Humble needed release notes ros2/ros2_documentation#2446

Closed

87 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isolate timer interface #447

Isolate timer interface #447

gezp commented Aug 19, 2021

ahcorde left a comment

SteveMacenski commented Aug 19, 2021

gezp commented Aug 19, 2021

SteveMacenski Aug 20, 2021

gezp Aug 20, 2021

SteveMacenski Aug 20, 2021

SteveMacenski Aug 20, 2021

tfoote Sep 20, 2021

SteveMacenski commented Aug 25, 2021 •

edited

Loading

gezp commented Sep 4, 2021

SteveMacenski commented Sep 16, 2021

tfoote left a comment

tfoote Sep 20, 2021

gezp commented Sep 20, 2021

ros-discourse commented Sep 20, 2021

ahcorde left a comment

gezp commented Sep 21, 2021 •

edited

Loading

tfoote left a comment

SteveMacenski commented Sep 28, 2021 •

edited

Loading

gezp commented Sep 28, 2021 •

edited

Loading

tfoote commented Sep 28, 2021

SteveMacenski commented Sep 29, 2021 •

edited

Loading

SteveMacenski commented Oct 12, 2021

ros-discourse commented Nov 1, 2021

SteveMacenski commented Nov 18, 2021 •

edited

Loading

ros-discourse commented Nov 22, 2021

tfoote commented Jan 5, 2022

SteveMacenski commented Jan 21, 2022

gezp commented Mar 24, 2022

ahcorde commented Mar 24, 2022

Isolate timer interface #447

Isolate timer interface #447

Conversation

gezp commented Aug 19, 2021

ahcorde left a comment

Choose a reason for hiding this comment

SteveMacenski commented Aug 19, 2021

gezp commented Aug 19, 2021

SteveMacenski Aug 20, 2021

Choose a reason for hiding this comment

gezp Aug 20, 2021

Choose a reason for hiding this comment

SteveMacenski Aug 20, 2021

Choose a reason for hiding this comment

SteveMacenski Aug 20, 2021

Choose a reason for hiding this comment

tfoote Sep 20, 2021

Choose a reason for hiding this comment

SteveMacenski commented Aug 25, 2021 • edited Loading

gezp commented Sep 4, 2021

SteveMacenski commented Sep 16, 2021

tfoote left a comment

Choose a reason for hiding this comment

tfoote Sep 20, 2021

Choose a reason for hiding this comment

gezp commented Sep 20, 2021

ros-discourse commented Sep 20, 2021

ahcorde left a comment

Choose a reason for hiding this comment

gezp commented Sep 21, 2021 • edited Loading

tfoote left a comment

Choose a reason for hiding this comment

SteveMacenski commented Sep 28, 2021 • edited Loading

gezp commented Sep 28, 2021 • edited Loading

tfoote commented Sep 28, 2021

SteveMacenski commented Sep 29, 2021 • edited Loading

SteveMacenski commented Oct 12, 2021

ros-discourse commented Nov 1, 2021

SteveMacenski commented Nov 18, 2021 • edited Loading

ros-discourse commented Nov 22, 2021

tfoote commented Jan 5, 2022

SteveMacenski commented Jan 21, 2022

gezp commented Mar 24, 2022

ahcorde commented Mar 24, 2022

SteveMacenski commented Aug 25, 2021 •

edited

Loading

gezp commented Sep 21, 2021 •

edited

Loading

SteveMacenski commented Sep 28, 2021 •

edited

Loading

gezp commented Sep 28, 2021 •

edited

Loading

SteveMacenski commented Sep 29, 2021 •

edited

Loading

SteveMacenski commented Nov 18, 2021 •

edited

Loading