Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor callbacks are no longer in a predictable order #2532

Open
thomasmoore-torc opened this issue May 14, 2024 · 25 comments
Open

Executor callbacks are no longer in a predictable order #2532

thomasmoore-torc opened this issue May 14, 2024 · 25 comments
Assignees

Comments

@thomasmoore-torc
Copy link

Bug report

Required Info:

Steps to reproduce issue

  1. Build the https://github.com/boschresearch/ros2_response_time_analysis project against ROS2 Jazzy (note that some minimal changes are required).
  2. Observe that the results of running the test show that the callbacks are executed in an arbitrary order as compared to the previous ordering which was with respect to the order in which the subscriptions/services were added to the node.

Expected behavior

With ROS2 Eloquent through Iron, the callbacks all executed in the following order, where the ordering was defined by the order in which the subscriptions/services were added to the node.

image

Actual behavior

With ROS2 Jazzy, the ordering is now arbitrary and not related to the order in which messages were received or the order in which subscriptions were added to the node. The execution order of timers also appears to be affected.

image

NOTE: While the low, medium, and high subscriptions are processed in the same order that they are transmitted, I believe this to be a fluke as the service and timer callbacks are not processed in the order they were registered or sent. For reference, the callbacks are registered in the following order:

  1. high
  2. medium
  3. low
  4. srv_high
  5. srv_medium
  6. srv_low
  7. timers (as directed by the test)

And the topics/services are sent in the following order:

  1. low
  2. medium
  3. high
  4. srv_high
  5. srv_low
  6. low
  7. medium
  8. high
  9. srv_high
  10. srv_low
  11. srv_medium
  12. srv_medium
  13. high

With timers set to expire at 200, 200, 2300, and 2300 milliseconds into the test.

Additional information

I believe this change is a result of #2142 due to the EntityCollection class being based on std::unordered_map where the key is the handle returned by rcl for the timer/subscription/etc. The entities are then added to the wait set in whatever order they are iterated through by the std::unordered_map and thus processed in that order. I haven't had the opportunity to test it, but I believe the ordering could be different from system to system or even run to run as the rcl handle values change.

The default AllocatorMemoryStrategy used in previous versions of rclcpp utilized std:vector for storing rcl handles, which maintained the order in which they were added to the node.

@alsora
Copy link
Collaborator

alsora commented May 14, 2024

Having a fixed order of callbacks is a design flaw of the SingleThreadedExecutor.
IMO users shouldn't rely on it.

Callbacks should be part of an event-driven pipeline and so they should be invoked in the order in which they are triggered, this is what happens with other executors such as the EventsExecutor.

@fujitatomoya
Copy link
Collaborator

@thomasmoore-torc thanks for pointing out the behavior difference.

although i am not sure if that is design flaw of SingleThreadedExecutor, i am with @alsora 's comment that the application should not rely on that.
however, some user application takes advantage of this fixed order behavior to prioritize the event in the same entity as described here. in that case, this is a big behavior change for user application... at least we should put this down on jazzy release note?

CC: @clalancette @mjcarroll @wjwwood @alsora

@thomasmoore-torc
Copy link
Author

For me, the concern is that the executor has behaved in the same manner likely since its inception and certainly across many releases of ROS2, and there is published work that describes this behavior. It's very likely that some users of the default executor have made use of this behavior and will break when migrating to Jazzy.

It's also concerning to me that, while the previous behavior was not necessarily defined, it was at least somewhat deterministic in that whatever callbacks were ready in the wait set were executed in the order that the callbacks were registered. The new approach is completely non-deterministic because the order is based on the hash of the memory addresses of the rcl handles for the waitables (e.g. the keys of the unordered maps). This seems like a step backwards to me since determinism is something that has been discussed as a need for users of ROS2.

This change certainly caught me off guard and caused me to spend a few hours today understanding what changed, which is why I thought it was important to raise it as a potential issue.

@mjcarroll
Copy link
Member

mjcarroll commented May 15, 2024

I fall somewhere in the middle on this one.

While I wouldn't go so far as to call it a design flaw, I think relying on the order that the callbacks are invoked based on how they are registered is probably a bit of a code smell. At the same time, ROS 2 has been around long enough for Hyrum's law to apply at this point.

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

I think that the new implementation is still deterministic in that things that are ready from a particular instance of wait will be executed, though the order may now be different.

To understand this a bit more, can you expand on what the impact you are seeing is? It was clearly enough to get your attention, so I'm curious what caused you to notice.

@wjwwood
Copy link
Member

wjwwood commented May 15, 2024

Thanks for bringing this up @thomasmoore-torc, and for making such a thorough issue about it. We certainly didn't mean to disrupt folks with the change, but that was always a possibility.

If I understand it correctly, then this change in behavior is rooted in our use of the unordered map, now ordering based on hashing rather than registration order (as a "final" tie-breaker)? If that's the case, then I think we could try to get back to that behavior, as long as it doesn't significantly impact performance (which it isn't clear to me if that's the case or not) or break any of our new tests (it shouldn't as those tests shouldn't care about execution order).

For me, the concern is that the executor has behaved in the same manner likely since its inception and certainly across many releases of ROS2, and there is published work that describes this behavior.

At the same time, ROS 2 has been around long enough for Hyrum's law to apply at this point.

Hyrum's rule aside, we have major releases and versions specifically so we can change things over time, as well as API specs so that we can control what is depended on and what isn't. And users should not have been depending on this particular behavior in my opinion, even if that behavior made sense and/or was least surprising. I have run into similar assumptions on my part about how APIs should behave in other libraries, e.g. the C++ std library, but after understanding the reason for the behavior, I never felt like it was the APIs fault, merely a lack of understand of the fundamentals on my part. We can't operate under the assumption that any part of the API or implementation is "done" and will never change, published works or not.

We have to also consider how this impacts other users who want more performance and throughput but do not depend on the ordering as described above. Should we restore the ordering at a performance loss? For me that's problematic because it violates the "don't pay for what you don't use" principal.

That all being said, hopefully we can do both (keep the improvements of these changes and make the behavior more like the old behavior). And/or we can consider making this behavior explicit in the API contract, so we can test it and prevent regression against it, but I would say we could only do that for specific executors like the SingleThreadedExecutor perhaps. Putting that requirement about ordering on the generic Executor API is probably too restrictive to be applied to all implementations.

For now, we can add some documentation to the release notes and to the executor API docs to make clear that this ordering isn't guaranteed and that's changed in Jazzy. Anyone willing to take a stab at that?

In the meantime, I'll look at fixing the order of the events without breaking any other improvements we've made. I'll have a look at https://github.com/boschresearch/ros2_response_time_analysis, but if there were a simple regression test that passed on Iron and failed on Jazzy, that would be easiest to work with. Do you have anything like that @thomasmoore-torc?

@fujitatomoya
Copy link
Collaborator

For now, we can add some documentation to the release notes and to the executor API docs to make clear that this ordering isn't guaranteed and that's changed in Jazzy. Anyone willing to take a stab at that?

@wjwwood i can take care of this. CC: @thomasmoore-torc

@thomasmoore-torc
Copy link
Author

To understand this a bit more, can you expand on what the impact you are seeing is? It was clearly enough to get your attention, so I'm curious what caused you to notice.

@mjcarroll - I was experimenting with an idea to use the clock jump callback to allow for execution of timer callbacks immediately on receipt of the /clock message as opposed to after the timer's guard condition is set to allow for a more deterministic ordering of callbacks. I was basing my experiment on the observation that callbacks are executed in their registration order, which would allow my experiment to call the timer callbacks before the other subscription callbacks. While it was just an experiment, I noticed that the callback ordering was not what it had been in previous version of ROS, which is what prompted me to raise this issue.

We have to also consider how this impacts other users who want more performance and throughput but do not depend on the ordering as described above. Should we restore the ordering at a performance loss? For me that's problematic because it violates the "don't pay for what you don't use" principal.

@wjwwood - this is a fair statement. However, I believe that determinism in robotics is just as important as performance, and maybe even more so in some aspects of robotics where safety is a consideration. That said, it's really hard to make a general purpose framework that meets the needs of everyone, so I understand where you are coming from.

I'm still wrapping my head around how the WaitSet is incorporated into the executor. It seems to me that the MemoryStrategy approach previously used by the executor has now been distributed among the wait_set_policies and the ExecutorEntitiesCollection, neither of which are configurable in the current implementation of the executor in a similar manner to how the MemoryStrategy was configurable previously. Perhaps a potential solution to this issue is to find a way to make these configurable within the executor to allow users to provide their own implementations without having to fully create a new executor.

@alsora
Copy link
Collaborator

alsora commented May 15, 2024

Perhaps a potential solution to this issue is to find a way to make these configurable within the executor to allow users to provide their own implementations without having to fully create a new executor.

The EventsExecutor indeed allows to define custom "queue" implementations.
The queue class defines the ordering in which ready entities are processed in one iteration of "spin".

We currently only have a FIFO queue, based on the order in which entities are triggered.
It's possible to define queues with fixed order of entities (either arbitrary or user-defined).
Custom queues can live outside of rclcpp, in the users workspaces.

We should consider if/when we will deprecate the waitset-based executors like SingleThreadedExecutor due to their problems (mostly performance).

@thomasmoore-torc
Copy link
Author

@alsora - to be fair, the EventsExecutor does indeed work for the experiment I was performing. However, this issue is more about the fact that the order in which the waitset-based executors like SingleThreadedExecutor processes callbacks has changed.

@thomasmoore-torc
Copy link
Author

@fujitatomoya - it might be worth adding something to the docs page for the Executors in the Scheduling semantics section: https://docs.ros.org/en/jazzy/Concepts/Intermediate/About-Executors.html#scheduling-semantics

@alsora
Copy link
Collaborator

alsora commented May 15, 2024

this issue is more about the fact that the order in which the waitset-based executors like SingleThreadedExecutor processes callbacks has changed.

Yes, the reason why I brought up the EventsExecutor is mostly because if we want to make improvements to the executors, which is a critical part of ROS applications, we should be very clear about what's an expected behavior and part of the API contract or not.
For example, if we explicitly indicated that the SingleThreadedExecutor (which is the current default) must have a fixed order of execution, it may be a lot harder to eventually change the default to a different executor, since other implementation may not guarantee that (for whatever reason).

IMO, it's worth investigating if it's possible to keep the old behavior without affecting performance, but I wouldn't go as far as adding to the API description (nor I would restore the old behavior if it has performance implications).
We should rather warn users that they shouldn't rely on the implementation details of the executor.

@wjwwood
Copy link
Member

wjwwood commented May 16, 2024

We should consider if/when we will deprecate the waitset-based executors like SingleThreadedExecutor due to their problems (mostly performance).

I wouldn't be in favor of deprecating them. I think they provide an important approach in that they have a queue-less implementation (at least not re-implementing the queues of the middleware), which avoids the need for queue pruning strategies and/or unbounded growth, and doesn't require re-implementing queue management QoS settings that the middleware already provides.

It's possible that we could achieve similar without the rcl wait set interface (it would instead use the callback based interface in rcl), but it would still result in a flat "ready set" (for lack of a better term) as a result for the user, rather than a queue.

Perhaps a potential solution to this issue is to find a way to make these configurable within the executor to allow users to provide their own implementations without having to fully create a new executor.

I agree the new implementation doesn't expose as much for customization, which is a fair point. There is a task (for post Jazzy) to implement a better "Executor Interface" abstract base class, which would all also be a good time to consider a modular and extensible approach for scheduling and other parts of the executor behavior.

However, the interactions between these components is extremely fragile, just because this is a technically difficult issue that deals with state and thread-safety and reentrancy etc..., so any customization is likely to be similarly fragile, but it should still be possible by the interface design.


I haven't finished investigating it, but my hope is still to make a change to restore the old behavior without a significant performance impact. To set expectations, however, this will likely be a bugfix to Jazzy after the release as to not invalidate the other testing we've already done. Though that's also not set in stone. I'll discuss it further with the release team.

@wjwwood wjwwood self-assigned this May 16, 2024
@wjwwood wjwwood moved this to In Progress in Jazzy Jalisco May 16, 2024
@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-client-library-working-group-update/37307/10

@jmachowinski
Copy link
Contributor

We should consider if/when we will deprecate the waitset-based executors like SingleThreadedExecutor due to their problems (mostly performance).

I wouldn't be in favor of deprecating them. I think they provide an important approach in that they have a queue-less implementation (at least not re-implementing the queues of the middleware), which avoids the need for queue pruning strategies and/or unbounded growth, and doesn't require re-implementing queue management QoS settings that the middleware already provides.

It's possible that we could achieve similar without the rcl wait set interface (it would instead use the callback based interface in rcl), but it would still result in a flat "ready set" (for lack of a better term) as a result for the user, rather than a queue.

I am with @alsora on this, and think that the waitset approach is a dead end.
FYI I have a POC on replacing this with an events approach, that results in a sort of a "ready set". I haven't had time yet to continue this work yet.

I haven't finished investigating it, but my hope is still to make a change to restore the old behavior without a significant performance impact.

I had a short look at the code, even if it would be possible, I don't see how we could patch anything in this code, without breaking ABI as it is all templates now.

@thomasmoore-torc I am a bit confused by the test, do the subscriptions receive all data at the same time ?

@thomasmoore-torc
Copy link
Author

@jmachowinski - the diagram below illustrates the issue I was looking into. The idea was that we were trying to see if we could get the callbacks to occur in the same order in which the associated were transmitted. More specifically, we wanted the timer callback to occur before the other callbacks. The thought was that since timers have higher priority than subscriptions, the timer callback should get called first, even when using simulation time. However, timer processing when using simulation time is a two step process (e.g. requires two waits of the wait set), so it's possible that subscription callbacks can get called before the timer callback when using simulation time.

image

The change we were experimenting with was to have the clock jump handler registered by the timer call the timer callback directly instead of triggering the guard condition.

image

The experiment above worked with the SingleThreadedExecutor in Iron (with the clock thread disabled) because of the fact that the internal /clock subscription was registered before all of the application subscriptions and the implementation was such that the callbacks would get called in their registration order. However, it did not work in Jazzy because of the change I've documented above where the order of callback is determined by the iteration order of std::unordered_map.

As @alsora suggested, the EventsExecutor is potentially a better choice for this since it is guaranteed to execute callbacks in the order in which the corresponding actions occurred except that there are issues with simulation time (#2480) and it suffers from the same two step timer issue that I outlined above, meaning that timer callbacks will still be called after subscriptions that occurred after the /clock message was received that would have caused a timer to become ready. However, the experiment of calling the timer callbacks from the jump handler in conjunction with using the EventsExecutor does yield callbacks being called in the expected order.

@jmachowinski
Copy link
Contributor

The idea was that we were trying to see if we could get the callbacks to occur in the same order in which the associated were transmitted.

What is the time between transmissions ?

@thomasmoore-torc
Copy link
Author

What is the time between transmissions ?

Negligible. Our simulation environment runs in frames where the simulation processing step occurs and then the topics are transmitted starting with /clock. As such, we sometimes receive messages which have a newer timestamp than the clock (due to the race condition caused by the clock thread) or we have subscription callbacks which occur before the timer callback (as illustrated above). Neither of these conditions are representative of the ordering that would be observed when not using simulated time.

We've deviated from the specific issue that this issue was intended to present, so would it make sense to move the sim time clock/timer discussion into a new issue?

@jmachowinski
Copy link
Contributor

I think there is a misunderstanding going on, I am still confused by the execution order, you expect vs the actual order.

What is strange to me, is the big times between the expected messages you got in the plot in your first post vs the execution order. It looks like the messages are send with 10th to 100th of ms apart. In this case you should get the expected order. Or is the system busy all the time processing the messages ?

My main question would be, do the messages (from the perspective of the executor) become ready all at the same time, or is there a delay between them ?

@jmachowinski
Copy link
Contributor

I think what would help me understand this better, would be :

  • System time stamps when the messages were send
  • System time stamps, when the processing of the message started
  • System time stamps, when the processing of the message ended
  • System time stamps, when the processing of the timer started
  • System time stamps, when the processing of the timer ended

If its not to much work, could you provide these ?

@tcm0116
Copy link

tcm0116 commented May 22, 2024

I think I've created some confusion. In the original post above, I used the source code from https://github.com/boschresearch/ros2_response_time_analysis to illustrate the effect that the change to the executor has. This repository is the code that was used by the Casini et al. at ECRTS 2019 paper referenced by the ROS2 Executor Scheduling semantics documentation. This code uses 500ms execution times for each callback to help illustrate the polling points and execution order used by the SingleThreadedExecutor.

The observations and experiment I outlined in #2532 (comment) were what I was working on when I noticed the difference in the executor behavior, which prompted me to revisit the https://github.com/boschresearch/ros2_response_time_analysis to better understand the change. Since that analysis is referenced by the ROS2 documentation, I thought it that using it for this issue would help illustrate the effect of the change to an analysis the ROS2 community was already aware of due to it being referenced in the documentation.

@fujitatomoya
Copy link
Collaborator

@thomasmoore-torc (or @tcm0116 ?)

just checking if i understand correctly.

The thought was that since timers have higher priority than subscriptions, the timer callback should get called first, even when using simulation time.
However, timer processing when using simulation time is a two step process (e.g. requires two waits of the wait set), so it's possible that subscription callbacks can get called before the timer callback when using simulation time.

true, simulation time requires the timer update based on /clock subscription on time source. then it can call the timer in next loop.
this is actually inverted order compared to system or steady time source once it comes to the simulation time.

The change we were experimenting with was to have the clock jump handler registered by the timer call the timer callback directly instead of triggering the guard condition.

because of above order issue for simulation time, it creates the simulation clock with jump handlers and then attach it to the time source of the node?
so that you do not use timer, but jump handlers can be issued when the clock subscription is called. (again, this is based on assumption that internal /clock subscription is 1st to be called in the queue if that is ready.)

the internal /clock subscription was registered before all of the application subscriptions and the implementation was such that the callbacks would get called in their registration order.

and then, that is broken with jazzy...

it suffers from the same two step timer issue that I outlined above, meaning that timer callbacks will still be called after subscriptions that occurred after the /clock message was received that would have caused a timer to become ready.

as you mentioned, it seems this is another issue to discuss. i would suggest that we can create another issue for this to track.

However, the experiment of calling the timer callbacks from the jump handler in conjunction with using the EventsExecutor does yield callbacks being called in the expected order.

thanks for sharing this, not a perfect solution but at least work-around.

@thomasmoore-torc
Copy link
Author

@thomasmoore-torc (or @tcm0116 ?)

Oops. I accidentally replied using my personal account.

I've created #2535 for the timer-specific issue that has been discussed.

@wjwwood
Copy link
Member

wjwwood commented May 23, 2024

I added a regression test specifically for the order of execution within a single entity type, which strikes at the originally reported issue and the use of the unordered_map which changed the tie-breaking, see: #2536

Can you guys have a look at that and see if it is reasonable and/or covers what was originally reported here?

Note that the EventsExecutor, while deterministic, is the call order, not the registration order, but that's to be expected as events can never occur at the exact same time from its perspective, as far as I can see. For the other executors, any events that occur while not spinning, or while executing between waits in the case of single threaded executors, appear (to the executor) to occur at the same time, so when executing them later the "tie-breaking" order used is the registration order, or it was until we started iterating over an unordered map, then it became the hash order.


I have an ABI-breaking change for rolling that can fix this (may need iteration or discussion), see: #2537

However, the above only works for rolling and the fix for jazzy would require a different mechanism that doesn't break ABI and might be prohibitively expensive to fix, but I'll look into that if we agree we should be fixing this in this way.


I'll catch up on the new discussion here asap.

@wjwwood
Copy link
Member

wjwwood commented May 23, 2024

Replying in order of the new comments...


I am with @alsora on this, and think that the waitset approach is a dead end.

I'm still not 100% convinced, but I do think the callback base approach is faster and generally preferable with the current rmw implementations, but I wouldn't extend that to deprecating the existing executors that happen to use the wait set as an implementation detail.

FYI I have a POC on replacing this with an events approach, that results in a sort of a "ready set". I haven't had time yet to continue this work yet.

Yeah, that's what I'd like to see and if we had that I could see that as a reason to deprecate the existing executors or replace their guts with that approach. But it's just fundamentally different from the way the EventsExecutor works.


The experiment above worked with the SingleThreadedExecutor in Iron (with the clock thread disabled) because of the fact that the internal /clock subscription was registered before all of the application subscriptions and the implementation was such that the callbacks would get called in their registration order. However, it did not work in Jazzy because of the change I've documented above where the order of callback is determined by the iteration order of std::unordered_map.

I think my proposed change to rolling will fix this, but ... oof that's fragile.

@wjwwood
Copy link
Member

wjwwood commented May 23, 2024

Just to help connect the dots, I made a suggestion about how to achieve this in a less fragile way over on the new issue that @thomasmoore-torc: #2535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

8 participants