-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Events Executor design #305
Conversation
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
I think that executor performance is something ROS 2 should really improve.
I have left some comments about the event based approach, but my overall feedback is that we should figure out what's a good default executor and only maintain one in rclcpp
.
This doesn't mean that the one in rclcpp
should be wait set based, I only mean that I would rather have one single threaded executor in rclcpp
(which could be event queue based) instead of three.
If we're going to switch to a event based executor, it would be great to have more information to understand what's the real limit of the current wait set based approach (see my other comments).
I also think that rmw
/rcl
API should be flexible enough so you can implement an event queue/work queue/wait set based executor out of tree, and I think that would be possible.
Based on your experience, do you think it's possible to implement an event based executor out of tree? What additions would be needed on rmw/rcl to make that possible?
Off the top of my head, I think that if rmw
API supported attaching listeners to entities, that would make this possible.
Some extra notes about simplifying code in rclcpp:
- AFAIR the
StaticSingleThreadExecutor
, isn't actually static. It rebuilds the wait set when an entity is added, but it uses a synchronization mechanism instead of always rebuiling it (which is much more efficient).
Based on that, it sounds like theStaticSingleThreadExecutor
is always a better choice than theSingleThreadExecutor
so I think that replacing the later should be completely removed. - We also have a
rclcpp::WaitSet
that we don't use in any of the wait set based executors.
articles/events_executor.md
Outdated
We can identify some major contributors to this overhead: | ||
- The cost of modifying and updating waitsets. | ||
Currently this operation happens multiple times for every iteration of the executor, even if the majority of robotics systems are mostly static, i.e. the number of ROS 2 entities is fixed and known at startup. | ||
A detailed walk-through of the waitset management process can be found [here](https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/6). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a one of the things that makes the performance of our currents wait set based executors bad is this.
In each iteration, we're collecting entities again and adding them to the rcl wait set, that adds a lot of overhead.
If instead of "nulling" the entities that aren't ready in the input buffer we would have a separate output buffer, I think that much of the overhead could be avoided. The steps described here as Clear wait set
, Add each entity to the rcl wait set
, Add each entity to the rmw wait set
won't be needed anymore. The only needed step would be to "zero" the output buffer. The "output buffer" could be a fixed size buffer, to avoid dynamic allocations which would damage performance.
I think that that change would avoid most of the "dance" we currently do to manage waitsets.
Was this approach explored?
I think it's something interesting to explore, in order to understand what is the real limitation of the current approach.
PS: I think that an event based executor is still a valid approach, but I would like to understand the limitations of the current executor better.
By the way, if you have done any profiling of the current executors in your testing it would be really interesting if you can share the resulting flamegraphs (or any other output).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mauropasse did a proof of concept where he took the StaticSingleThreadedExecutor
and directly accessed the dds wait set from the rclcpp layer, thus removing all the maintenance.
He can give you more details about that.
The performance definitely improved, but not as much as we expected.
To give you an idea, we were running a 10 nodes system on a single core platform (RaspberryPi1), the CPU with StaticSingleThreadedExecutor
was approximately 30%.
We first integrated the TimersManager and this improved to 25%.
Then, on top of this, we added the direct access to DDS wait set and the performance improved to 20%.
On the other hand, the same test with EventsExecutor
results in 11% CPU usage.
I want to add that the numbers mentioned above were from an experiment, where we were definitely missing some mutexes, additional checks for safety and so on, that would all have added an impact on performance, while the EventsExecutor
numbers are from our current implementation, which already includes these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ivan. We have explored the "output buffer" approach, using a fixed size buffer which allowed to not rebuild the waitsets at every iteration. The output buffer was a simple ints array with the indexes of the ready-to-work entities matching the ExecutableList
ordering (the rmw_wait
of rmw_cyclonedds
already works with indexes).
So, if the ExecutableLlist
was made of:
rclcpp::subscription 1 // This subscription has an event (msg received)
rclcpp::subscription 2
rclcpp::subscription 3 // This subscription has an event (msg received)
the ouput buffer array was [1,3]
, so we had to process subscription 1 and 3.
The waitset still has to be walked-through on rmw_wait
though, to identify entities ready to work.
This approach lead to a improvement in CPU, but the amount of changes in the code in multiple layers didn't justify the not-so-great CPU improvement, so we discarded the idea. That's the point in which we decided that if multiple-layer code modifications were to be done, not only the CPU improvement should be noticeable but also we should fix some of the issues of the current waitset based executors. The "Requirements" section summarizes what we expected from a new design.
Regarding your comment about profiling of the current executors, you can find a PDF executors benchmark.pdf
here:
https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/20
Soon we'll find a way to share the benchmark files, as github doesn't allow me to attach them here.
Thanks for your time reviewing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The waitset still has to be walked-through on rmw_wait though, to identify entities ready to work.
Can you clarify this? I'm not completely sure what do you mean.
rmw_wait
is currently doing two "walks" through all rmw entities:
- It's walking through all entities that were added to rmw wait set to get their corresponding DDS conditions, and attach them to the DDS wait set.
- After waiting, it walks again through all rmw entities to generate the output, comparing the conditions of each rmw object with all conditions that the DDS wait set marked as ready.
If we're not zeroing the rmw wait set in each iteration and re-adding entities, the first "walk-through" won't be needed anymore (you just store a wait set with all the conditions already attached in the rmw object).
The second "walk-through" is still needed, but how it's currently implemented is extremely sub-optimal.
We're walking through all rmw entities again, when we only need to walk through the conditions that the DDS wait set marked as ready. The number of ready conditions will be typically quite less than all the rmw entities in the wait set, so iterating over the ready conditions is much better.
e.g.: see Connext example.
A map from DDS conditions to rmw entities will do the rest of the work (in some DDS implementations supporting more modern cpp, you can even attach a handler to a condition).
Disclaimer: Talking about how to improve performance is much easier than actually improving it, so my analysis can be quite wrong.
Current implementations:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your comment about profiling of the current executors, you can find a PDF executors benchmark.pdf here:
https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/20
Soon we'll find a way to share the benchmark files, as github doesn't allow me to attach them here.
Ah I didn't see the pdf, thanks! I will take a closer look to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this? I'm not completely sure what do you mean.
rmw_wait is currently doing two "walks" through all rmw entities:
In the current FastRTPS rmw_wait, there could be potentially four (4) whole "walks" through all waitset entities:
- To attach the
rmw_wait
condition variable to all entities (whole waitset). There is a single condition variable shared by all entities. - Go again though all of them and see if any has work to do (potential whole waitset, if no entity is ready).
- When a condition variable has been triggered check predicate, so repeat step 2 to check all of them until an entitiy with work to do is found (potential whole waitset, if the last guard condition of the waitset is the only with work to do).
- Go through the whole waitset to detach the condition variable and set to
0
the pointer of the entitiy with no work to do.
As you said, the 1st one won't be needed with an outuput buffer approach.
The 2nd would be needed to run and potentially avoid the wait on the condition variable, as it's currently done.
The 3rd one would repeat step two, as works as the predicate for the triggered condition variable.
The 4th should set the output buffer (the detach part won't be needed either).
2nd, 3rd and 4th could be combined in a single function which takes care of check for work and at the same time sets the output buffer. In the worst case scenario you need to use this function twice, which is going through all the waitset.
A map from DDS conditions to rmw entities will do the rest of the work (in some DDS implementations
supporting more modern cpp, you can even attach a handler to a condition).
Currently there is a single DDS condition variable shared on all rmw entities, so this map won't work unless you create different condition variables for all. Attaching a handler to a condition could allow you to set the output buffer directly where the listener is notified about new data. So you end up with a "queue" of entities with work to do, which is quite similiar to what we do on the EventsExecutor
where the handler would be "push an event in the queue" callback function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current FastRTPS rmw_wait, there could be potentially four (4) whole "walks" through all waitset entities:
Thanks for the detailed explanation.
I focused my previous comment more in the rmw_connext
implementation, which is a bit different (though it hides some of the complexity in the DDS waitset implementation).
Then, on top of this, we added the direct access to DDS wait set and the performance improved to 20%.
On the other hand, the same test with EventsExecutor results in 11% CPU usage.
I didn't see @alsora comment before.
I think that's a good comparison to do, bare DDS waitset based performance vs the event based executor.
If CPU usage ends up being that main motivation of the change, I wouldn't like to be comparing a well optimized implementation with a badly optimized one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try align with your discussion
There are some differences in the rmw implementations. Depending on whether a DDS wait set or just a condition variable is used, the number of "walk-throughs" vary or are at different places. I feel that with the waitset approach there are at least these steps
- check all the attached conditions (rmw entities) for work to do. If there is already work go to 4.
- there is no work to do, so wait for work
- after waiting check all the attached conditions (rmw entities) for work to do
- fill the output buffer
As you said all the clearing, attaching and detaching with every wait can be optimized. When the the attaching and detaching to an underlying DDS wait set would be done in the context of calls like rcl_wait_set_add_subscription()
, I'm wondering if rmw_wait()
would no more need the rmw entities as parameter but only the rmw_wait_set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if rmw_wait() would no more need the rmw entities as parameter but only the rmw_wait_set
Yeah, that sounds like it would be the case.
Depending on whether a DDS wait set or just a condition variable is used, the number of "walk-throughs" vary or are at different places
A waitset is likely implemented with a condition variable + mutex, e.g. here and here code partially showing what cyclonedds does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alsora Hi, I am tring to reuse this great work about event based executor.
But could you show me some code to demostrate the mean of "direct access to DDS wait set"?
Because from all rmw codes what I have read, the user code actually get direct access to dds wait set, it does a static_cast
type conversion, for example , see rmw_iceoryx rmw_wait()
https://github.com/ros2/rmw_iceoryx/blob/foxy/rmw_iceoryx_cpp/src/rmw_wait.cpp
rmw_wait(
rmw_subscriptions_t * subscriptions,
rmw_guard_conditions_t * guard_conditions,
rmw_services_t * services,
rmw_clients_t * clients,
rmw_events_t * events,
rmw_wait_set_t * wait_set,
const rmw_time_t * wait_timeout)
{
RCUTILS_CHECK_ARGUMENT_FOR_NULL(subscriptions, RMW_RET_ERROR);
RCUTILS_CHECK_ARGUMENT_FOR_NULL(guard_conditions, RMW_RET_ERROR);
RCUTILS_CHECK_ARGUMENT_FOR_NULL(services, RMW_RET_ERROR);
RCUTILS_CHECK_ARGUMENT_FOR_NULL(clients, RMW_RET_ERROR);
RCUTILS_CHECK_ARGUMENT_FOR_NULL(events, RMW_RET_ERROR);
RCUTILS_CHECK_ARGUMENT_FOR_NULL(wait_set, RMW_RET_ERROR);
RMW_CHECK_TYPE_IDENTIFIERS_MATCH(
rmw_wait
: waitset, wait_set->implementation_identifier,
rmw_get_implementation_identifier(), return RMW_RET_ERROR);
iox::popo::WaitSet<iox::MAX_NUMBER_OF_ATTACHMENTS_PER_WAITSET> * waitset =
static_cast<iox::popo::WaitSet<iox::MAX_NUMBER_OF_ATTACHMENTS_PER_WAITSET> *>(wait_set->data);
//...
}
So I am confused about the difference between direct-access and non-direct-access to dds wait_set? .
Thanks for your time.
articles/events_executor.md
Outdated
- The cost of modifying and updating waitsets. | ||
Currently this operation happens multiple times for every iteration of the executor, even if the majority of robotics systems are mostly static, i.e. the number of ROS 2 entities is fixed and known at startup. | ||
A detailed walk-through of the waitset management process can be found [here](https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/6). | ||
- Inefficiency of the timers management. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
How timers are currently handled adds a lot of overhead, and it also doesn't work for simulated time well (particularly, RTF>1 isn't working at all AFAIR)
articles/events_executor.md
Outdated
Then the executor can process these events in a FIFO manner, without need for expensive entities look-ups. | ||
Processing an event results in different operations depending on the entity that generated it. | ||
|
||
Timers are executed in a separate task by a timers manager, where they are kept in a priority queue sorted according to their expiration time. The task then has to monitor only the first element of the queue and can execute its callback as soon as it expires. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this proposal is valid also in the case of the current wait set based executors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the TimersManager is on purpose a separate entity from the executor to have multiple parts of the system (or even application themselves) to take advantage of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it reaches beyond the scope of this PR, bear in mind that eventually the TimersManager
class will have to deal with different timers associated with different clocks. Perhaps a queue per clock type? Synchronous API for spin_once()
implementation will probably suffer though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what's the use-case for having timers with different clock coexist in the same executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what's the use-case for having timers with different clock coexist in the same executor?
The simplest one I can think of is a node that unconditionally requires a timer to be associated with the system clock (e.g. to implement some I/O timeout) but otherwise associates all other timers to the ROS clock for proper operation in simulation whenever RTF != 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the existing executors handle this situation?
In general I think that mixing timers with different clocks in the same execution context can easily lead to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't :). Thus the disclaimer above:
Though it reaches beyond the scope of this PR, bear in mind that eventually the TimersManager class will have to deal with different timers...
It's a known issue that hopefully we can overcome at some point in the future.
articles/events_executor.md
Outdated
- Scalability: The executor must perform well independently of the system size, e.g. the latency for processing a fixed amount of messages must be independent from the number of entities. | ||
This can flow into the "You don't pay for what you don't use" paradigm. | ||
- Extensibility: The executor should set a base from which it is straight-forward to add more functionalities without impacting performances or requiring heavy modifications of the ROS2 core libraries. | ||
- Ordering of events: The executor should process events in the same order they have occurred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point, and in the wait set based approach keeping events ordered is maybe more difficult (I'm not sure if it's even possible or not).
There are still some corner cases in an event based executor, as described at the end of the article and in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that an approach with only a wait set can provide correct ordering of events at all.
In a wait set, each entity can only signal if it has work to do or not, but not how many "events" have to be processed.
If you have 2 ready subscriptions, one with 1 message and the other with 2 messages, it will be quite complex to figure out the correct order for processing those messages.
You will definitely need another structure on top of the wait set to order the messages correctly.
articles/events_executor.md
Outdated
The `EventsQueue` does not have a knowledge of QoS settings, this may cause some issues in some particular situations. | ||
|
||
##### Unbound growth of event queue | ||
|
||
As long as events are pushed into the queue, this will keep growing, regardless of the actual history size of the underlying middleware entity. | ||
This may cause several events to accumulate, for example while the `EventsExecutor` is not spinning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the big advantage that the wait set based executor has is that it doesn't suffer for this two issues (that's already handled by the queue of each entity based on its QoS settings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the wait set approach does not suffer from the unbound growth issue.
I'm sure it is possible to address that also in an event based executor, but is definitely not simple.
On the other hand, the second open issue mentioned here (i.e. that in some situations the order may be invalidated) is simply completely ignored by the wait set approach, where there are no guarantees at all about the order in which entities are processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. it could happen that for a specific entity, there are more events in the executor queue than actually in the queue of the entity. Would the executor then try to take a message from a subscription and not execute the callback as it realized that the take did not return a message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@budrus it is possible to do that, but it requires additional APIs.
At the moment an executor in the rclcpp layer can only call execute_subscription(Subscription)
which will take the first element from the underlying DDS queue of the subscription.
The events executor design would allow to attach IDs to the messages/events, but then the rclcpp layer should also provide an API execute_subscription(Subscription, id)
that will execute only if it's the correct element.
You can find more details on the problem at the bottom of the design page (i.e. Invalid ordering corner case
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised statically allocated (bounded) buffers were not a requirement for resource constrained platforms -- often running time-sensitive applications. Large enough circular buffers would probably be OK for most such cases (whatever large enough means for any given application).
@ivanpauno thank you for taking the time to evaluate our proposal. Here some replies to your comment.
Yes, the ROS client library should really provide only 1 executor that works well.
In order to implement the EventsExecutor, we had to only add some boilerplate code in rmw/rcl layers. It's just the same API copy-pasted for all the entities.
Yes, the StaticSingleThreadedExecutor is not really static and from our experience it always performs better than the SingleThreadedExecutor. From a discussion I had with @wjwwood I remember that the However, from our analysis, both these solutions still cause a considerable CPU overhead with respect to running your pub/sub application without ROS and directly using your middleware of choice. |
I strongly disagree with this. There is no reason that a client library should be limited to one executor. We should figure out what our use cases are and ensure that each is covered completely by one executor. If the collection of use cases requires more than one executor to completely cover, then so be it. There is also the possibility that a particular language may provide multiple ways to handle event queues and multi-threading, so one executor for each might make sense in that language's client library. There should be one default executor, of course. When you call |
articles/events_executor.md
Outdated
We can identify some major contributors to this overhead: | ||
- The cost of modifying and updating waitsets. | ||
Currently this operation happens multiple times for every iteration of the executor, even if the majority of robotics systems are mostly static, i.e. the number of ROS 2 entities is fixed and known at startup. | ||
A detailed walk-through of the waitset management process can be found [here](https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/6). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try align with your discussion
There are some differences in the rmw implementations. Depending on whether a DDS wait set or just a condition variable is used, the number of "walk-throughs" vary or are at different places. I feel that with the waitset approach there are at least these steps
- check all the attached conditions (rmw entities) for work to do. If there is already work go to 4.
- there is no work to do, so wait for work
- after waiting check all the attached conditions (rmw entities) for work to do
- fill the output buffer
As you said all the clearing, attaching and detaching with every wait can be optimized. When the the attaching and detaching to an underlying DDS wait set would be done in the context of calls like rcl_wait_set_add_subscription()
, I'm wondering if rmw_wait()
would no more need the rmw entities as parameter but only the rmw_wait_set
articles/events_executor.md
Outdated
The data structure so includes the type of the entity that generated the event and a handle to its corresponding rclcpp object. | ||
|
||
Considering the entities currently available in ROS 2, the content of the event data structure can be any of the following: | ||
- `ExecutorEventType::SUBSCRIPTION_EVENT` and an identifier for a `rclcpp::SubscriptionBase` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This identifier for a SubscriptionBase is the subscription_handle that is provided with the rmw_subscription_set_events_executor_callback
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
In the implementation that we proposed (see the other PRs) that's a raw pointer to an rclcpp subscription itself.
However anything that will allow the executor to get what is needed to call execute_subscription()
will do the work: for example the executor may have a map of weak pointers to subscriptions and the event may include a key for that map.
articles/events_executor.md
Outdated
However, when calling the `execute_subscription()` API, the message that generated event 1 will not be available anymore as it has been overwritten from the message that generated event 3. | ||
The `execute_subscription()` API will not fail, but rather it will process the message that generated event 3. | ||
This violates the ordering as event 2 should have been processed before that. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wait set and the event queue approach both indicate that there is work to do and the actual work is grabbed later. The event queue has "normally" already a temporal order and less "walk through" and sorting than the wait set, but is also not deterministic as you describe here.
Determinism would be reached with a work queue or if the collected work is sorted before executed in case of event queue and wait set. So determinism would be possible if you pay with some runtime.
I'm wondering if it would be possible to have an intermediate API that provides a container with work to execute. If this is then optimized for performance, latency, determinism or however sorted is the job of the flexible part of the executor. But I'm not familiar enough with the upper layers of rcl and rclcpp to judge if this makes sense. This maybe goes in the direction of the out of tree proposal @ivanpauno had.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Determinism would be reached with a work queue or if the collected work is sorted
before executed in case of event queue and wait set. So determinism would be possible
if you pay with some runtime.
Determinism could be reached implementing what @alsora proposed above:
The events executor design would allow to attach IDs to the messages/events,
but then the rclcpp layer should also provide an API execute_subscription(Subscription, id)
that will execute only if it's the correct element.
Regarding
I'm wondering if it would be possible to have an intermediate API that provides a container with work to execute.
The issue with a work queue is that a message taken to be processed later could be invalid by the time of processing. If the QoS depth of the queue is 1 (one) and you take the message, but another message arrives before processing it, the stored message in the work queue should not be processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Determinism could be reached implementing what @alsora proposed above:
The events executor design would allow to attach IDs to the messages/events, but then the rclcpp layer should also provide an API execute_subscription(Subscription, id) that will execute only if it's the correct element.
That's a pragmatic and efficient solution for a guaranteed temporal order. The middleware needs to provide a sneak preview for the id without taking the message but that's a solvable problem too
The issue with a work queue is that a message taken to be processed later could be invalid by the time of processing. If the QoS depth of the queue is 1 (one) and you take the message, but another message arrives before processing it, the stored message in the work queue should not be processed.
I see. That are the consequences of having two queues that are processed at different times. One in the middleware and one in the executor. The work queue looses the QoS depth, the event queue can fulfil it with the consequence of outdated NOP events as discussed above.
What I meant with
I'm wondering if it would be possible to have an intermediate API that provides a container with work to execute.
is the question if it would make sense to have an executor top layer API that provides the work to do without the need to go back to the middleware again (and the races this entails). This allows the most determinism as the executor then would provide a plan that is guaranteed to be executed. But I fear this is too far away from where we are today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant with ... is the question if it would make sense to have an executor top layer API that provides the work to do without the need to go back to the middleware again (and the races this entails).
We have considered that approach, as it is one step closer to get better performances too. But for the sake of simplicity for a first proposal, and taking into account that we are already getting great performance improvements, we decided to leave it as a follow up improvement for the future. Looking at the CPU profiling flamegraphs we did (using rclcpp IPC) it doesn't seem to be much overhead of "go back to the middleware again" as the message passing occurs only on RCLCPP layer. The approach could have a bigger impact when not using rclcpp IPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that rmw/rcl API should be flexible enough so you can implement an event queue/work queue/wait set based executor out of tree, and I think that would be possible.
I'm with @ivanpauno here. Not against EventsExecutor
being part of the core, but in favor of putting the machinery in place for out-of-tree implementations to be the norm.
articles/events_executor.md
Outdated
Then the executor can process these events in a FIFO manner, without need for expensive entities look-ups. | ||
Processing an event results in different operations depending on the entity that generated it. | ||
|
||
Timers are executed in a separate task by a timers manager, where they are kept in a priority queue sorted according to their expiration time. The task then has to monitor only the first element of the queue and can execute its callback as soon as it expires. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it reaches beyond the scope of this PR, bear in mind that eventually the TimersManager
class will have to deal with different timers associated with different clocks. Perhaps a queue per clock type? Synchronous API for spin_once()
implementation will probably suffer though.
articles/events_executor.md
Outdated
This may happen before the `EventsExecutor` start to process those events or while it is processing them. | ||
|
||
The current implementation addresses this problem by providing an additional function pointer to each associated entity. | ||
This function pointer will be called whenever the entity's destructor is invoked and it will result in clearing the `EventsQueue` from any event that was generated by this entity. Note that this may block until the `EventsQueue` is done processing events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the EventsExecutor
does not own entities, what would prevent entity destruction during event processing (and the resulting deadlock)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EventsExecutor does not own entities mainly to align with the current ROS 2 design paradigm where there are no APIs for removing/destroying entities.
There should not be a deadlock: the entity destructor and the event queue processing thread will all try to lock the same single mutex.
Assume the following: entity A and entity B are being destroyed and they both have events on the queue.
If A acquires the lock first, while holding this lock it can remove its own events from the queue.
Upon release the events executor may acquire the lock and it will execute B events.
Then B may acquire the lock and it will not have anything to remove.
There are definitely other solutions to handle this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, mutex prevents entity destruction during event processing:
https://github.com/irobot-ros/rclcpp/blob/irobot/add-events-executor/rclcpp/include/rclcpp/executors/events_executor.hpp#L216
https://github.com/irobot-ros/rclcpp/blob/irobot/add-events-executor/rclcpp/include/rclcpp/executors/events_executor.hpp#L225
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, mutex prevents entity destruction during event processing
Hmm, that's true when destruction and execution happen in separate threads. But unless I'm missing something from your rclcpp
based prototype (which I might), if any entity associated with the event executor is destroyed within one of the callbacks that consume events while execution_mutex_
is being held, on entity removal it will attempt to lock the execution_mutex_
again. And execution_mutex_
isn't recursive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that situation can cause a deadlock and making that mutex recursive should be enough to handle it.
However, again I'm not sure what the use-case for that could be.
In the current design the event types are used only to distinguish between entities.
This means that you would have a regular subscription that, depending on the content of the message it receives, could decide to destroy itself or maybe to destroy another entity?
I think that ROS does not really support this workflow as entities are usually handled with shared ownership.
A different solution that we evaluated consisted in having additional event types to denote that we want to remove an entity from the executor.
However this does not really implements easily as ROS lacks several APIs for removing/destroying entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr I agree with @hidmic here:
Whether that's useful or desirable is of no consequence: the implementation must behave in a sane way.
To merge this, we should keep the existing ownership behavior, and therefore preserve the ability to "delete" (or start asynchronously deleting) entities that might be being used by the executor from within a callback.
Specifically, the executor should have weak ownership on the entities associated with it, and strong ownership when using them.
I'm fully conscious that the ownership of entities is a weak point of the current implementation, but IMO is something not related to the EventsExecutor design in particular, but rather to its implementation or to the executors in general.
If you want it to be as much of a "drop-in replacement" as possible (and I think that's a good thing for comparing them and making it easier to integrate) then the EventsExecutor
should follow the existing behavior of shared ownership (i.e. user owns, executor has weak ownership). If it really is as decoupled as you say, then changing the executor implementation AND the sharing behavior is changing two things at once, and is generally not desired.
The weak ownership model is not performant as the "fast path" require you to lock all the entities every time. I haven't measured this overhead, so maybe it's not really comparable to the other performance issues in ROS, but it's definitely something that eventually needs to be addressed.
I think that should be evaluated and addressed separately from this proposed change in the executor design. It's not clear to me that "but it's definitely something that eventually needs to be addressed." is true.
I would prefer a shared ownership model where additional APIs allow you to easily handle entities (e.g. add/remove from node and add/remove from executor).
I don't believe this is substantially different from the current approach, because it still doesn't let you remove items from within a callback. You'd really only be able to have remove
(which doesn't work in a callback) and async_remove
/remove_later
which would work in a callback, but that's not any different than releasing ownership of the shared pointer within a callback, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
StaticSingleThreadedExecutor
has shared ownership of entities. This model does not allow a callback to destroy other entities, as the executor will always keep 1 reference to them. It actually does not really allow to destroy entities at all without first removing the node from the executor.
That's a good point. I see shared ownership of entities in rclcpp::experimental::ExecutableList
.
Considering your scenario (a callback that destroys an entity) it could first remove that entity from the executor and then destroy it.
However this would require many more changes to ROS.
I agree with you that we're digging a bit too much into implementation details here, but how so? Again, I could be wrong, but considering that callback groups have weak ownership and that removing an entity from the EventsExecutor
effectively drops the push event callback and prunes related events, you just have to ensure you synchronize properly (as mentioned above). It won't trigger an event, sure, but that's a non-issue. Explicitly removing a node is safe too since you use guard conditions. Your design is pretty solid in that sense.
As @wjwwood says, we can discuss ownership later on. I'd much rather get your work reviewed and merged instead of growing the scope right now. With an emphasis on the rmw
changes that enable this implementation and (potentially) future ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thank you all for the discussion.
We are updating the rclcpp implementation to have weak ownership of entities.
This will allow us also to measure the difference in performance between the two approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: We've improved the weak pointer implementation for the new ownership model. The performances are now much better, using just 5% more CPU than the no-ownership approach (which used raw pointers thus no lock()
needed to get ownership) for the worst case scenario (our biggest topology at 100hz in a single core platform). The graph above has been updated reflecting the latest performances, as well as the rclcpp
PR ros2/rclcpp#1416 including the improvements.
articles/events_executor.md
Outdated
The `EventsQueue` does not have a knowledge of QoS settings, this may cause some issues in some particular situations. | ||
|
||
##### Unbound growth of event queue | ||
|
||
As long as events are pushed into the queue, this will keep growing, regardless of the actual history size of the underlying middleware entity. | ||
This may cause several events to accumulate, for example while the `EventsExecutor` is not spinning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised statically allocated (bounded) buffers were not a requirement for resource constrained platforms -- often running time-sensitive applications. Large enough circular buffers would probably be OK for most such cases (whatever large enough means for any given application).
articles/events_executor.md
Outdated
- `ExecutorEventType::SUBSCRIPTION_EVENT` and an identifier for a `rclcpp::SubscriptionBase` object. | ||
- `ExecutorEventType::SERVICE_EVENT` and an identifier for a `rclcpp::ServiceBase` object. | ||
- `ExecutorEventType::CLIENT_EVENT` and an identifier for a `rclcpp::ClientBase` object. | ||
- `ExecutorEventType::WAITABLE_EVENT` and an identifier for a `rclcpp::Waitable` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud, I wonder why combining polymorphism and type tagging, as opposed to just using polymorphism (i.e. making everything a Waitable
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that, but it would be a quite big change for ROS as currently the waitable and the subscriptionBase, serviceBase, etc are very distinguished from each other.
We really tried to provide a design that allows to implement the new executor while not disrupting the current ROS system.
Ok, I've been making progress on the rmw changes related to this. One big thing I did was remove the term I also improved the documentation and implementation for I did a major refactor to I still think we need some tests for |
This pull request has been mentioned on ROS Discourse. There might be relevant details there: |
@wjwwood is this work still ongoing? |
@adamdbrw Yes, but slowly. There are pull requests implementing needed features for some of what is included here, but they didn't make it into galactic, and have been mostly ideal since, which is my fault. I expect we'll start making progress again soon. |
@adamdbrw FYI we have branches where we implemented the whole design and that we are using for our testing/development. The PRs mentioned in the first post implement the first part of the design (i.e. RMW listener APIs). |
Hi, we updated the design with some of the latest changes originated from the middelware WG discussions. |
Hi, we recently ran into the performance issues in ROS2 and tried the EventsExecutor from the iRobot forks and it worked phenomenally. |
FYI, before someone else also does the same thing: We really needed the executor, so I rebased the work by @irobot-ros on the current master. This was mostly
In addition, I created packages for ubuntu 22.04. They are available at the following release repositories:
A strange bug I experienced was that I had to revert ros2/rmw_fastrtps#583 because an invalid qos type was falsely triggered. I don't know why that happened but will investigate. I hope this helps someone who also needs the executor, or it helps you when it will be merged. |
We plan to create soon an "events executor" repository that only contains the executor and it's compatible with standard rclcpp library. This will allow ROS 2 Humble users to use the events executor in an easy way. In the meantime, we will also open new PRs to rclcpp to merge this executor in the mainline. |
@alsora any updates on the executor repo? :) |
@timonegk this PR ros2/rmw_fastrtps#625 fixes the issue you got about an invalid qos type falsely triggered. So you don't have to revert the FastRTPS PR ros2/rmw_fastrtps#583 |
@mauropasse Thanks, works great! |
What are the next steps for this? Is this something that needs to be brought up in the middleware WG? |
Hi, thank you everyone for your interest in the events-executor. We plan to release this as a debian package as soon as the known limitations and bugs mentioned in the readme are addressed. |
@alsora Thanks for sharing you and your teams amazing work! Is the plan to get this into rclcpp, or to keep it separate in this repo? Once the debians are out, I can imagine people getting confused about having to add a dependency on #include "rclcpp/executors/events_executor/events_executor.hpp" makes it look like its from |
Yes the plan is definitely to merge this into rclcpp, the separate repo and the debian package are just intermediate solutions to let people play with it in the meantime. It's now almost 2 years that we have a working implementation of the events executor (we first opened a rclcpp PR with it in 2020), but it may take a while before all the pieces and requirements are merged. The reason for having the same paths as We can probably just change the top-level directory from P.S. if you have additional feedbacks and suggestions, feel free to open a ticket on the events executor repository or code PRs |
@wjwwood @clalancette @ivanpauno this PR is open since almost 2 years. |
As i understand this Event Executor is a Single threaded executor. Is there an update to this to support multithreaded EventExecutor with Callback Groups? Please revert back. |
The events executor by design can either be single thread or multi-thread. The implementation currently available at https://github.com/irobot-ros/events-executor support two modes:
Contributions are welcome to support an arbitrary number of threads. |
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/questions-about-the-intrinsic-acquisition-of-osrc/28763/15 |
So... LGTM? |
We've recently had a discussion about what to do with this pr, and it's possible we can merge it and immediately mark it as historical. There are a few issues at play:
|
Hi, something strange happened to this PR. |
This PR presents a design for a new type of executor, with the purpose of addressing some of the long-standing issues of the current implementation.
This follows the idea presented in the discourse post
https://discourse.ros.org/t/ros2-middleware-change-proposal/15863
The new executor uses an events queue and a timers manager as opposed to waitsets, to efficiently execute entities with work to do.
Developed by iRobot
Mauro Passerino
Lenny Story
Alberto Soragna
Connects to: