-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executor callbacks are no longer in a predictable order #2532
Comments
Having a fixed order of callbacks is a design flaw of the Callbacks should be part of an event-driven pipeline and so they should be invoked in the order in which they are triggered, this is what happens with other executors such as the |
@thomasmoore-torc thanks for pointing out the behavior difference. although i am not sure if that is design flaw of |
For me, the concern is that the executor has behaved in the same manner likely since its inception and certainly across many releases of ROS2, and there is published work that describes this behavior. It's very likely that some users of the default executor have made use of this behavior and will break when migrating to Jazzy. It's also concerning to me that, while the previous behavior was not necessarily defined, it was at least somewhat deterministic in that whatever callbacks were ready in the wait set were executed in the order that the callbacks were registered. The new approach is completely non-deterministic because the order is based on the hash of the memory addresses of the This change certainly caught me off guard and caused me to spend a few hours today understanding what changed, which is why I thought it was important to raise it as a potential issue. |
I fall somewhere in the middle on this one. While I wouldn't go so far as to call it a design flaw, I think relying on the order that the callbacks are invoked based on how they are registered is probably a bit of a code smell. At the same time, ROS 2 has been around long enough for Hyrum's law to apply at this point.
I think that the new implementation is still deterministic in that things that are ready from a particular instance of To understand this a bit more, can you expand on what the impact you are seeing is? It was clearly enough to get your attention, so I'm curious what caused you to notice. |
Thanks for bringing this up @thomasmoore-torc, and for making such a thorough issue about it. We certainly didn't mean to disrupt folks with the change, but that was always a possibility. If I understand it correctly, then this change in behavior is rooted in our use of the unordered map, now ordering based on hashing rather than registration order (as a "final" tie-breaker)? If that's the case, then I think we could try to get back to that behavior, as long as it doesn't significantly impact performance (which it isn't clear to me if that's the case or not) or break any of our new tests (it shouldn't as those tests shouldn't care about execution order).
Hyrum's rule aside, we have major releases and versions specifically so we can change things over time, as well as API specs so that we can control what is depended on and what isn't. And users should not have been depending on this particular behavior in my opinion, even if that behavior made sense and/or was least surprising. I have run into similar assumptions on my part about how APIs should behave in other libraries, e.g. the C++ std library, but after understanding the reason for the behavior, I never felt like it was the APIs fault, merely a lack of understand of the fundamentals on my part. We can't operate under the assumption that any part of the API or implementation is "done" and will never change, published works or not. We have to also consider how this impacts other users who want more performance and throughput but do not depend on the ordering as described above. Should we restore the ordering at a performance loss? For me that's problematic because it violates the "don't pay for what you don't use" principal. That all being said, hopefully we can do both (keep the improvements of these changes and make the behavior more like the old behavior). And/or we can consider making this behavior explicit in the API contract, so we can test it and prevent regression against it, but I would say we could only do that for specific executors like the For now, we can add some documentation to the release notes and to the executor API docs to make clear that this ordering isn't guaranteed and that's changed in Jazzy. Anyone willing to take a stab at that? In the meantime, I'll look at fixing the order of the events without breaking any other improvements we've made. I'll have a look at https://github.com/boschresearch/ros2_response_time_analysis, but if there were a simple regression test that passed on Iron and failed on Jazzy, that would be easiest to work with. Do you have anything like that @thomasmoore-torc? |
@wjwwood i can take care of this. CC: @thomasmoore-torc |
@mjcarroll - I was experimenting with an idea to use the clock jump callback to allow for execution of timer callbacks immediately on receipt of the
@wjwwood - this is a fair statement. However, I believe that determinism in robotics is just as important as performance, and maybe even more so in some aspects of robotics where safety is a consideration. That said, it's really hard to make a general purpose framework that meets the needs of everyone, so I understand where you are coming from. I'm still wrapping my head around how the |
The We currently only have a FIFO queue, based on the order in which entities are triggered. We should consider if/when we will deprecate the waitset-based executors like |
@alsora - to be fair, the |
@fujitatomoya - it might be worth adding something to the docs page for the Executors in the Scheduling semantics section: https://docs.ros.org/en/jazzy/Concepts/Intermediate/About-Executors.html#scheduling-semantics |
Yes, the reason why I brought up the IMO, it's worth investigating if it's possible to keep the old behavior without affecting performance, but I wouldn't go as far as adding to the API description (nor I would restore the old behavior if it has performance implications). |
I wouldn't be in favor of deprecating them. I think they provide an important approach in that they have a queue-less implementation (at least not re-implementing the queues of the middleware), which avoids the need for queue pruning strategies and/or unbounded growth, and doesn't require re-implementing queue management QoS settings that the middleware already provides. It's possible that we could achieve similar without the rcl wait set interface (it would instead use the callback based interface in rcl), but it would still result in a flat "ready set" (for lack of a better term) as a result for the user, rather than a queue.
I agree the new implementation doesn't expose as much for customization, which is a fair point. There is a task (for post Jazzy) to implement a better "Executor Interface" abstract base class, which would all also be a good time to consider a modular and extensible approach for scheduling and other parts of the executor behavior. However, the interactions between these components is extremely fragile, just because this is a technically difficult issue that deals with state and thread-safety and reentrancy etc..., so any customization is likely to be similarly fragile, but it should still be possible by the interface design. I haven't finished investigating it, but my hope is still to make a change to restore the old behavior without a significant performance impact. To set expectations, however, this will likely be a bugfix to Jazzy after the release as to not invalidate the other testing we've already done. Though that's also not set in stone. I'll discuss it further with the release team. |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/ros-2-client-library-working-group-update/37307/10 |
I am with @alsora on this, and think that the waitset approach is a dead end.
I had a short look at the code, even if it would be possible, I don't see how we could patch anything in this code, without breaking ABI as it is all templates now. @thomasmoore-torc I am a bit confused by the test, do the subscriptions receive all data at the same time ? |
@jmachowinski - the diagram below illustrates the issue I was looking into. The idea was that we were trying to see if we could get the callbacks to occur in the same order in which the associated were transmitted. More specifically, we wanted the timer callback to occur before the other callbacks. The thought was that since timers have higher priority than subscriptions, the timer callback should get called first, even when using simulation time. However, timer processing when using simulation time is a two step process (e.g. requires two waits of the wait set), so it's possible that subscription callbacks can get called before the timer callback when using simulation time. The change we were experimenting with was to have the clock jump handler registered by the timer call the timer callback directly instead of triggering the guard condition. The experiment above worked with the As @alsora suggested, the |
What is the time between transmissions ? |
Negligible. Our simulation environment runs in frames where the simulation processing step occurs and then the topics are transmitted starting with We've deviated from the specific issue that this issue was intended to present, so would it make sense to move the sim time clock/timer discussion into a new issue? |
I think there is a misunderstanding going on, I am still confused by the execution order, you expect vs the actual order. What is strange to me, is the big times between the expected messages you got in the plot in your first post vs the execution order. It looks like the messages are send with 10th to 100th of ms apart. In this case you should get the expected order. Or is the system busy all the time processing the messages ? My main question would be, do the messages (from the perspective of the executor) become ready all at the same time, or is there a delay between them ? |
I think what would help me understand this better, would be :
If its not to much work, could you provide these ? |
I think I've created some confusion. In the original post above, I used the source code from https://github.com/boschresearch/ros2_response_time_analysis to illustrate the effect that the change to the executor has. This repository is the code that was used by the Casini et al. at ECRTS 2019 paper referenced by the ROS2 Executor Scheduling semantics documentation. This code uses 500ms execution times for each callback to help illustrate the polling points and execution order used by the The observations and experiment I outlined in #2532 (comment) were what I was working on when I noticed the difference in the executor behavior, which prompted me to revisit the https://github.com/boschresearch/ros2_response_time_analysis to better understand the change. Since that analysis is referenced by the ROS2 documentation, I thought it that using it for this issue would help illustrate the effect of the change to an analysis the ROS2 community was already aware of due to it being referenced in the documentation. |
@thomasmoore-torc (or @tcm0116 ?) just checking if i understand correctly.
true, simulation time requires the timer update based on
because of above order issue for simulation time, it creates the simulation clock with jump handlers and then attach it to the time source of the node?
and then, that is broken with jazzy...
as you mentioned, it seems this is another issue to discuss. i would suggest that we can create another issue for this to track.
thanks for sharing this, not a perfect solution but at least work-around. |
Oops. I accidentally replied using my personal account. I've created #2535 for the timer-specific issue that has been discussed. |
I added a regression test specifically for the order of execution within a single entity type, which strikes at the originally reported issue and the use of the Can you guys have a look at that and see if it is reasonable and/or covers what was originally reported here? Note that the EventsExecutor, while deterministic, is the call order, not the registration order, but that's to be expected as events can never occur at the exact same time from its perspective, as far as I can see. For the other executors, any events that occur while not spinning, or while executing between waits in the case of single threaded executors, appear (to the executor) to occur at the same time, so when executing them later the "tie-breaking" order used is the registration order, or it was until we started iterating over an unordered map, then it became the hash order. I have an ABI-breaking change for rolling that can fix this (may need iteration or discussion), see: #2537 However, the above only works for rolling and the fix for jazzy would require a different mechanism that doesn't break ABI and might be prohibitively expensive to fix, but I'll look into that if we agree we should be fixing this in this way. I'll catch up on the new discussion here asap. |
Replying in order of the new comments...
I'm still not 100% convinced, but I do think the callback base approach is faster and generally preferable with the current rmw implementations, but I wouldn't extend that to deprecating the existing executors that happen to use the wait set as an implementation detail.
Yeah, that's what I'd like to see and if we had that I could see that as a reason to deprecate the existing executors or replace their guts with that approach. But it's just fundamentally different from the way the EventsExecutor works.
I think my proposed change to rolling will fix this, but ... oof that's fragile. |
Just to help connect the dots, I made a suggestion about how to achieve this in a less fragile way over on the new issue that @thomasmoore-torc: #2535 |
Bug report
Required Info:
Steps to reproduce issue
Expected behavior
With ROS2 Eloquent through Iron, the callbacks all executed in the following order, where the ordering was defined by the order in which the subscriptions/services were added to the node.
Actual behavior
With ROS2 Jazzy, the ordering is now arbitrary and not related to the order in which messages were received or the order in which subscriptions were added to the node. The execution order of timers also appears to be affected.
NOTE: While the
low
,medium
, andhigh
subscriptions are processed in the same order that they are transmitted, I believe this to be a fluke as the service and timer callbacks are not processed in the order they were registered or sent. For reference, the callbacks are registered in the following order:And the topics/services are sent in the following order:
With timers set to expire at 200, 200, 2300, and 2300 milliseconds into the test.
Additional information
I believe this change is a result of #2142 due to the EntityCollection class being based on
std::unordered_map
where the key is the handle returned byrcl
for the timer/subscription/etc. The entities are then added to the wait set in whatever order they are iterated through by thestd::unordered_map
and thus processed in that order. I haven't had the opportunity to test it, but I believe the ordering could be different from system to system or even run to run as thercl
handle values change.The default AllocatorMemoryStrategy used in previous versions of
rclcpp
utilizedstd:vector
for storingrcl
handles, which maintained the order in which they were added to the node.The text was updated successfully, but these errors were encountered: