Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolate timer interface #447

Merged
merged 1 commit into from
Mar 24, 2022
Merged

Isolate timer interface #447

merged 1 commit into from
Mar 24, 2022

Conversation

gezp
Copy link
Contributor

@gezp gezp commented Aug 19, 2021

Signed-off-by: zhenpeng ge zhenpeng.ge@qq.com

As described in #446, CreateTimerROS create timers by using application's Node, there are some potential issue when application has some long running callback of services, etc.

In this PR, a dedicated callback group and executor with thread are used to isolate TF timer so that timers won't be blocked by application's callback.

Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some style issues and some tests are failing https://build.ros2.org/job/Rpr__geometry2__ubuntu_focal_amd64/313/

@gezp gezp requested a review from ahcorde August 19, 2021 08:14
@SteveMacenski
Copy link
Contributor

tf2_ros_test_message_filter and test_message_filter tests failed (in addition to linting)

@gezp
Copy link
Contributor Author

gezp commented Aug 19, 2021

I have fixed these issues, please review again.

rclcpp::node_interfaces::NodeTimersInterface::SharedPtr node_timers)
: node_base_(node_base), node_timers_(node_timers), next_timer_handle_index_(0)
rclcpp::node_interfaces::NodeTimersInterface::SharedPtr node_timers,
bool spin_thread)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spin_thread doesn't seem quite right here. Since this has to do specifically with the timer processing, I think a different name here would be more appropriate so its more clear to users what this actually is for (e.g. allows the timer interface to be processed independently from the base_node application processing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you give me some suggestions for naming ? how about is_independent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure actually. I'd be interested what the maintainers here think (or if they feel its even worth parameterizing or should just always be enabled)

rclcpp::CallbackGroupType::MutuallyExclusive, false);
executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
executor_->add_callback_group(callback_group_, node_base_);
dedicated_thread_ = std::make_unique<std::thread>([&]() {executor_->spin();});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for maintainers: Would it make more sense to you to only have this thread running when in use (e.g. when we create_timer create this thread as well and when we reset / cancel it we destroy it then?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we we shouldn't start a thread if it's going to be unused. That will cause overhead and make debugging generally harder. Especially if there ends up potentially being several instances.

@SteveMacenski
Copy link
Contributor

SteveMacenski commented Aug 25, 2021

@ahcorde @tfoote can you give this another look? Zhenpeng's summer project has a limited timeline and getting this and #442 handled would let us get alot more done in Nav2 with the remainder of his time in the summer program.

Beyond the comments I already left, I approve this approach.

@gezp
Copy link
Contributor Author

gezp commented Sep 4, 2021

please review this PR, i need your suggestion, thank you! @clalancette @tfoote.

@SteveMacenski
Copy link
Contributor

@ahcorde can you give this a look?

Copy link
Contributor

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the threading, if you cancel the callback the thread should also stop. And while looking at that. I see that you've added cleanup for the thread on destruction. However it doesn't appear that the timer gets cancelled on destruction which I think is a potential lifecycle issue as the callback refers to this but if this destructs that will be a use after free bug.

At first I was thinking that this was just going to be for internal usage and then the internal thread used for subscribing should work. But this is for a user callback. And as such we should not create inaccessible callback groups and extra threads. We should expose to the user the ability to associate the timer callback with a callback group of their choice. If they do not associate the callback with a group, they will get the current behavior of using the default callback group. If the user cares about this not blocking they have the option then to use a different default callback group setting or to create a custom callback group. This leaves the user in charge of the threading and provides them with all the standard threading tools that are consistent with the rest of the system instead of a single dimensional boolean value.

rclcpp::CallbackGroupType::MutuallyExclusive, false);
executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
executor_->add_callback_group(callback_group_, node_base_);
dedicated_thread_ = std::make_unique<std::thread>([&]() {executor_->spin();});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we we shouldn't start a thread if it's going to be unused. That will cause overhead and make debugging generally harder. Especially if there ends up potentially being several instances.

@gezp
Copy link
Contributor Author

gezp commented Sep 20, 2021

@tfoote thanks for your suggestions, i have updated this PR, please review again.

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2021-9-16/22372/1

Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test ?

@gezp
Copy link
Contributor Author

gezp commented Sep 21, 2021

sorry, i couldn't find a test file about CreateTimerROS , so i don't know how to do for test, can you give me more details about it? @ahcorde

Copy link
Contributor

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that's much cleaner passing through the callback groups.

@SteveMacenski
Copy link
Contributor

SteveMacenski commented Sep 28, 2021

So @gezp, what would the implementation of this look like if we wanted to isolate this in Nav2? Would we need to make a new callback group to pass to it? That works, but isn't very clean, since I think this is the behavior that should be default, otherwise its a little buried to users that TF2's timer is being run on their node's executor.

If they do not associate the callback with a group, they will get the current behavior of using the default callback group.

I believe the opposite is better, but agree that exposing it to the user is necessary. I think by default we should construct a callback group for use internally, but if a user does not want isolated behavior, then they can pass in their own default / other callback group.

That would still leave the user in charge of the threading if they wanted to but then not trip up TF2 with user application code execution models.

@gezp
Copy link
Contributor Author

gezp commented Sep 28, 2021

what would the implementation of this look like if we wanted to isolate this in Nav2? Would we need to make a new callback group to pass to it?

yes, if we want to isolate CreateTimerROS , we need to create a new callback group in Nav2 node, and pass it to CreateTimerROS.

@tfoote
Copy link
Contributor

tfoote commented Sep 28, 2021

otherwise its a little buried to users that TF2's timer is being run on their node's executor.

It's not buried. Every callback that you register uses the nodes default executor unless you specify otherwise.

I think by default we should construct a callback group for use internally, but if a user does not want isolated behavior, then they can pass in their own default / other callback group.

For consistency we shouldn't create an extra thread (not just a callback queue is required for isolating) just for TF timer callbacks. That's not how we work with any other callbacks. And I don't see why this should appear by default in a different callback queue than any other time callback that you register. There's no inherently blocking call to within the library or anything else that differentiates this callback for a tf2 timer from any other callback or timer on the system all of which will be blocked by the long running callback. The case that this is solving is that the user/developer has both registered a long running callback and then wants a quick firing callback to be run in parallel. To do that the standard answer is to change the executor to be multi threaded, or put the callback in different queues so they can be serviced simultaneously. The tf2 library is designed to be a library that can be accessed from any thread and doesn't change or add any complexity to the users' threading model. Adding a TF2 timer callback shouldn't automatically add a thread to your system.

Adding a thread by default will solve this case, but will make all the other cases more complex. In this case it's very clear, if you run a long running callback your timers won't fire until you return, unless you enable multi-threading in one way or another. If these timer callback were to fire by default in an isolated thread, for consistency I would want all the timer callbacks to fire on isolated threads. At which point we're now significantly increasing the complexity of the default simplest system. And we would have the option to overide the complex system to return it to a simple single threadable system. But that's counter to our design goals of keeping the default situation simple.

@SteveMacenski
Copy link
Contributor

SteveMacenski commented Sep 29, 2021

I think its a matter of preference. I respect that viewpoint. My thoughts on it is that utilities like TF should fully encapsulate, by default, its execution (but be override-able) such that users don't have to think about it in their application code. Keeping the application code using utility libraries as clear as possible is my usual aim.

Every callback that you register uses the nodes default executor unless you specify otherwise.

I intellectually don't disagree with that, but I'm not sure on average how many users really understand what TF is doing behind the scenes to have that appreciation when they're simply copy-pasting code around to get a TF Buffer / Listener to work in their project. They're not trying to use CreateTimerROS (or the actual API of createTimer which registers the callback for them to even be aware of, that's all internal to TF), they're trying to use TF2. They're registering this timer interface for TF to use, but I doubt many people are really tracking where the timer interface object is being used or how.

For example, below is a typical snippet of code for using TF in a ROS 2 application. It's not immediately clear to anyone what the timer interface is doing, or how it might impact their system's execution. I would argue it is pretty hidden from a user that calling waitForTransform (or something that call it) within a callback with the default executor could have a future that times out for no other reason than TF is sharing execution with the application. That's a pretty subtle issue to debug, especially if it doesn't always happen due to the transforms being published at irregular or not-fast rates.

  tf_ = std::make_shared<tf2_ros::Buffer>(get_clock());
  auto timer_interface = std::make_shared<tf2_ros::CreateTimerROS>(
    get_node_base_interface(),
    get_node_timers_interface());
  tf_->setCreateTimerInterface(timer_interface);
  transform_listener_ = std::make_shared<tf2_ros::TransformListener>(*tf_);

The case that this is solving is that the user/developer has both registered a long running callback and then wants a quick firing callback to be run in parallel. To do that the standard answer is to change the executor to be multi threaded, or put the callback in different queues so they can be serviced simultaneously.

No disagreement intellectually to that, but it requires a user of TF to be able to get to a point that they're even aware that this is the issue at hand. Some design advice / exposure / documentation to that respect would go a long way (or by default hiding those details from a user). I don't believe the thread needs to be running at all times, just created "in-time" when a timer is created and destroyed when the timer is canceled.

But I respect your position -- just wanting to give you a full accounting. This can be solved by documentation (and internal warnings within TF if it detects that this is what is happening) as well. If we don't enable default behavior that hides this concern from them, I would recommend having a warning if we detect lock-up on that timer/future so that people can immediately know what's happening and what might be a recommended course of action to resolve.

@SteveMacenski
Copy link
Contributor

So what's the word here, can we merge this?

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2021-10-28/22947/1

@clalancette clalancette self-assigned this Nov 18, 2021
@SteveMacenski
Copy link
Contributor

SteveMacenski commented Nov 18, 2021

@ahcorde @clalancette what's the good word?

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2021-11-18/23209/1

@tfoote tfoote requested a review from ahcorde January 5, 2022 01:11
@tfoote
Copy link
Contributor

tfoote commented Jan 5, 2022

@ahcorde the changes are passing CI and significantly cleaned up. I've rerequested a review from you.

@SteveMacenski
Copy link
Contributor

@ahcorde

What's the good word?

@gezp
Copy link
Contributor Author

gezp commented Mar 24, 2022

@ahcorde could you have a time to review this PR?

Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com>

fix some issues

Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com>

update

Signed-off-by: zhenpeng ge <zhenpeng.ge@qq.com>
@ahcorde
Copy link
Contributor

ahcorde commented Mar 24, 2022

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants