Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events Executor design #305

Merged
merged 0 commits into from
Jan 8, 2024
Merged

Events Executor design #305

merged 0 commits into from
Jan 8, 2024

Conversation

irobot-ros
Copy link

@irobot-ros irobot-ros commented Oct 21, 2020

This PR presents a design for a new type of executor, with the purpose of addressing some of the long-standing issues of the current implementation.
This follows the idea presented in the discourse post
https://discourse.ros.org/t/ros2-middleware-change-proposal/15863

The new executor uses an events queue and a timers manager as opposed to waitsets, to efficiently execute entities with work to do.

Developed by iRobot
Mauro Passerino
Lenny Story
Alberto Soragna


Connects to:

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/20

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
I think that executor performance is something ROS 2 should really improve.

I have left some comments about the event based approach, but my overall feedback is that we should figure out what's a good default executor and only maintain one in rclcpp.
This doesn't mean that the one in rclcpp should be wait set based, I only mean that I would rather have one single threaded executor in rclcpp (which could be event queue based) instead of three.

If we're going to switch to a event based executor, it would be great to have more information to understand what's the real limit of the current wait set based approach (see my other comments).

I also think that rmw/rcl API should be flexible enough so you can implement an event queue/work queue/wait set based executor out of tree, and I think that would be possible.
Based on your experience, do you think it's possible to implement an event based executor out of tree? What additions would be needed on rmw/rcl to make that possible?
Off the top of my head, I think that if rmw API supported attaching listeners to entities, that would make this possible.

Some extra notes about simplifying code in rclcpp:

  • AFAIR the StaticSingleThreadExecutor, isn't actually static. It rebuilds the wait set when an entity is added, but it uses a synchronization mechanism instead of always rebuiling it (which is much more efficient).
    Based on that, it sounds like the StaticSingleThreadExecutor is always a better choice than the SingleThreadExecutor so I think that replacing the later should be completely removed.
  • We also have a rclcpp::WaitSet that we don't use in any of the wait set based executors.

We can identify some major contributors to this overhead:
- The cost of modifying and updating waitsets.
Currently this operation happens multiple times for every iteration of the executor, even if the majority of robotics systems are mostly static, i.e. the number of ROS 2 entities is fixed and known at startup.
A detailed walk-through of the waitset management process can be found [here](https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/6).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a one of the things that makes the performance of our currents wait set based executors bad is this.

In each iteration, we're collecting entities again and adding them to the rcl wait set, that adds a lot of overhead.
If instead of "nulling" the entities that aren't ready in the input buffer we would have a separate output buffer, I think that much of the overhead could be avoided. The steps described here as Clear wait set, Add each entity to the rcl wait set, Add each entity to the rmw wait set won't be needed anymore. The only needed step would be to "zero" the output buffer. The "output buffer" could be a fixed size buffer, to avoid dynamic allocations which would damage performance.

I think that that change would avoid most of the "dance" we currently do to manage waitsets.
Was this approach explored?
I think it's something interesting to explore, in order to understand what is the real limitation of the current approach.

PS: I think that an event based executor is still a valid approach, but I would like to understand the limitations of the current executor better.


By the way, if you have done any profiling of the current executors in your testing it would be really interesting if you can share the resulting flamegraphs (or any other output).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mauropasse did a proof of concept where he took the StaticSingleThreadedExecutor and directly accessed the dds wait set from the rclcpp layer, thus removing all the maintenance.
He can give you more details about that.

The performance definitely improved, but not as much as we expected.
To give you an idea, we were running a 10 nodes system on a single core platform (RaspberryPi1), the CPU with StaticSingleThreadedExecutor was approximately 30%.
We first integrated the TimersManager and this improved to 25%.
Then, on top of this, we added the direct access to DDS wait set and the performance improved to 20%.
On the other hand, the same test with EventsExecutor results in 11% CPU usage.

I want to add that the numbers mentioned above were from an experiment, where we were definitely missing some mutexes, additional checks for safety and so on, that would all have added an impact on performance, while the EventsExecutor numbers are from our current implementation, which already includes these.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ivan. We have explored the "output buffer" approach, using a fixed size buffer which allowed to not rebuild the waitsets at every iteration. The output buffer was a simple ints array with the indexes of the ready-to-work entities matching the ExecutableList ordering (the rmw_wait of rmw_cyclonedds already works with indexes).
So, if the ExecutableLlist was made of:

rclcpp::subscription 1 // This subscription has an event (msg received)
rclcpp::subscription 2
rclcpp::subscription 3 // This subscription has an event (msg received)

the ouput buffer array was [1,3] , so we had to process subscription 1 and 3.
The waitset still has to be walked-through on rmw_wait though, to identify entities ready to work.
This approach lead to a improvement in CPU, but the amount of changes in the code in multiple layers didn't justify the not-so-great CPU improvement, so we discarded the idea. That's the point in which we decided that if multiple-layer code modifications were to be done, not only the CPU improvement should be noticeable but also we should fix some of the issues of the current waitset based executors. The "Requirements" section summarizes what we expected from a new design.

Regarding your comment about profiling of the current executors, you can find a PDF executors benchmark.pdf here:
https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/20

Soon we'll find a way to share the benchmark files, as github doesn't allow me to attach them here.

Thanks for your time reviewing this!

Copy link
Member

@ivanpauno ivanpauno Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waitset still has to be walked-through on rmw_wait though, to identify entities ready to work.

Can you clarify this? I'm not completely sure what do you mean.

rmw_wait is currently doing two "walks" through all rmw entities:

  • It's walking through all entities that were added to rmw wait set to get their corresponding DDS conditions, and attach them to the DDS wait set.
  • After waiting, it walks again through all rmw entities to generate the output, comparing the conditions of each rmw object with all conditions that the DDS wait set marked as ready.

If we're not zeroing the rmw wait set in each iteration and re-adding entities, the first "walk-through" won't be needed anymore (you just store a wait set with all the conditions already attached in the rmw object).

The second "walk-through" is still needed, but how it's currently implemented is extremely sub-optimal.
We're walking through all rmw entities again, when we only need to walk through the conditions that the DDS wait set marked as ready. The number of ready conditions will be typically quite less than all the rmw entities in the wait set, so iterating over the ready conditions is much better.
e.g.: see Connext example.
A map from DDS conditions to rmw entities will do the rest of the work (in some DDS implementations supporting more modern cpp, you can even attach a handler to a condition).

Disclaimer: Talking about how to improve performance is much easier than actually improving it, so my analysis can be quite wrong.

Current implementations:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your comment about profiling of the current executors, you can find a PDF executors benchmark.pdf here:
https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/20
Soon we'll find a way to share the benchmark files, as github doesn't allow me to attach them here.

Ah I didn't see the pdf, thanks! I will take a closer look to it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this? I'm not completely sure what do you mean.    
rmw_wait is currently doing two "walks" through all rmw entities:

In the current FastRTPS rmw_wait, there could be potentially four (4) whole "walks" through all waitset entities:

  1. To attach the rmw_wait condition variable to all entities (whole waitset). There is a single condition variable shared by all entities.
  2. Go again though all of them and see if any has work to do (potential whole waitset, if no entity is ready).
  3. When a condition variable has been triggered check predicate, so repeat step 2 to check all of them until an entitiy with work to do is found (potential whole waitset, if the last guard condition of the waitset is the only with work to do).
  4. Go through the whole waitset to detach the condition variable and set to 0 the pointer of the entitiy with no work to do.

As you said, the 1st one won't be needed with an outuput buffer approach.
The 2nd would be needed to run and potentially avoid the wait on the condition variable, as it's currently done.
The 3rd one would repeat step two, as works as the predicate for the triggered condition variable.
The 4th should set the output buffer (the detach part won't be needed either).

2nd, 3rd and 4th could be combined in a single function which takes care of check for work and at the same time sets the output buffer. In the worst case scenario you need to use this function twice, which is going through all the waitset.

A map from DDS conditions to rmw entities will do the rest of the work (in some DDS implementations
supporting more modern cpp, you can even attach a handler to a condition).

Currently there is a single DDS condition variable shared on all rmw entities, so this map won't work unless you create different condition variables for all. Attaching a handler to a condition could allow you to set the output buffer directly where the listener is notified about new data. So you end up with a "queue" of entities with work to do, which is quite similiar to what we do on the EventsExecutor where the handler would be "push an event in the queue" callback function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current FastRTPS rmw_wait, there could be potentially four (4) whole "walks" through all waitset entities:

Thanks for the detailed explanation.
I focused my previous comment more in the rmw_connext implementation, which is a bit different (though it hides some of the complexity in the DDS waitset implementation).

Then, on top of this, we added the direct access to DDS wait set and the performance improved to 20%.
On the other hand, the same test with EventsExecutor results in 11% CPU usage.

I didn't see @alsora comment before.
I think that's a good comparison to do, bare DDS waitset based performance vs the event based executor.

If CPU usage ends up being that main motivation of the change, I wouldn't like to be comparing a well optimized implementation with a badly optimized one.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try align with your discussion

There are some differences in the rmw implementations. Depending on whether a DDS wait set or just a condition variable is used, the number of "walk-throughs" vary or are at different places. I feel that with the waitset approach there are at least these steps

  1. check all the attached conditions (rmw entities) for work to do. If there is already work go to 4.
  2. there is no work to do, so wait for work
  3. after waiting check all the attached conditions (rmw entities) for work to do
  4. fill the output buffer

As you said all the clearing, attaching and detaching with every wait can be optimized. When the the attaching and detaching to an underlying DDS wait set would be done in the context of calls like rcl_wait_set_add_subscription(), I'm wondering if rmw_wait() would no more need the rmw entities as parameter but only the rmw_wait_set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if rmw_wait() would no more need the rmw entities as parameter but only the rmw_wait_set

Yeah, that sounds like it would be the case.

Depending on whether a DDS wait set or just a condition variable is used, the number of "walk-throughs" vary or are at different places

A waitset is likely implemented with a condition variable + mutex, e.g. here and here code partially showing what cyclonedds does.

Copy link

@ZhenshengLee ZhenshengLee Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alsora Hi, I am tring to reuse this great work about event based executor.
But could you show me some code to demostrate the mean of "direct access to DDS wait set"?

Because from all rmw codes what I have read, the user code actually get direct access to dds wait set, it does a static_cast type conversion, for example , see rmw_iceoryx rmw_wait()

https://github.com/ros2/rmw_iceoryx/blob/foxy/rmw_iceoryx_cpp/src/rmw_wait.cpp

rmw_wait(
  rmw_subscriptions_t * subscriptions,
  rmw_guard_conditions_t * guard_conditions,
  rmw_services_t * services,
  rmw_clients_t * clients,
  rmw_events_t * events,
  rmw_wait_set_t * wait_set,
  const rmw_time_t * wait_timeout)
{
  RCUTILS_CHECK_ARGUMENT_FOR_NULL(subscriptions, RMW_RET_ERROR);
  RCUTILS_CHECK_ARGUMENT_FOR_NULL(guard_conditions, RMW_RET_ERROR);
  RCUTILS_CHECK_ARGUMENT_FOR_NULL(services, RMW_RET_ERROR);
  RCUTILS_CHECK_ARGUMENT_FOR_NULL(clients, RMW_RET_ERROR);
  RCUTILS_CHECK_ARGUMENT_FOR_NULL(events, RMW_RET_ERROR);
  RCUTILS_CHECK_ARGUMENT_FOR_NULL(wait_set, RMW_RET_ERROR);

  RMW_CHECK_TYPE_IDENTIFIERS_MATCH(
    rmw_wait
    : waitset, wait_set->implementation_identifier,
    rmw_get_implementation_identifier(), return RMW_RET_ERROR);

  iox::popo::WaitSet<iox::MAX_NUMBER_OF_ATTACHMENTS_PER_WAITSET> * waitset =
    static_cast<iox::popo::WaitSet<iox::MAX_NUMBER_OF_ATTACHMENTS_PER_WAITSET> *>(wait_set->data);
//...
}

So I am confused about the difference between direct-access and non-direct-access to dds wait_set? .

Thanks for your time.

- The cost of modifying and updating waitsets.
Currently this operation happens multiple times for every iteration of the executor, even if the majority of robotics systems are mostly static, i.e. the number of ROS 2 entities is fixed and known at startup.
A detailed walk-through of the waitset management process can be found [here](https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/6).
- Inefficiency of the timers management.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
How timers are currently handled adds a lot of overhead, and it also doesn't work for simulated time well (particularly, RTF>1 isn't working at all AFAIR)

Then the executor can process these events in a FIFO manner, without need for expensive entities look-ups.
Processing an event results in different operations depending on the entity that generated it.

Timers are executed in a separate task by a timers manager, where they are kept in a priority queue sorted according to their expiration time. The task then has to monitor only the first element of the queue and can execute its callback as soon as it expires.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposal is valid also in the case of the current wait set based executors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the TimersManager is on purpose a separate entity from the executor to have multiple parts of the system (or even application themselves) to take advantage of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it reaches beyond the scope of this PR, bear in mind that eventually the TimersManager class will have to deal with different timers associated with different clocks. Perhaps a queue per clock type? Synchronous API for spin_once() implementation will probably suffer though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what's the use-case for having timers with different clock coexist in the same executor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what's the use-case for having timers with different clock coexist in the same executor?

The simplest one I can think of is a node that unconditionally requires a timer to be associated with the system clock (e.g. to implement some I/O timeout) but otherwise associates all other timers to the ROS clock for proper operation in simulation whenever RTF != 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the existing executors handle this situation?
In general I think that mixing timers with different clocks in the same execution context can easily lead to errors.

Copy link
Contributor

@hidmic hidmic Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't :). Thus the disclaimer above:

Though it reaches beyond the scope of this PR, bear in mind that eventually the TimersManager class will have to deal with different timers...

It's a known issue that hopefully we can overcome at some point in the future.

- Scalability: The executor must perform well independently of the system size, e.g. the latency for processing a fixed amount of messages must be independent from the number of entities.
This can flow into the "You don't pay for what you don't use" paradigm.
- Extensibility: The executor should set a base from which it is straight-forward to add more functionalities without impacting performances or requiring heavy modifications of the ROS2 core libraries.
- Ordering of events: The executor should process events in the same order they have occurred.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point, and in the wait set based approach keeping events ordered is maybe more difficult (I'm not sure if it's even possible or not).

There are still some corner cases in an event based executor, as described at the end of the article and in this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that an approach with only a wait set can provide correct ordering of events at all.
In a wait set, each entity can only signal if it has work to do or not, but not how many "events" have to be processed.

If you have 2 ready subscriptions, one with 1 message and the other with 2 messages, it will be quite complex to figure out the correct order for processing those messages.
You will definitely need another structure on top of the wait set to order the messages correctly.

articles/events_executor.md Outdated Show resolved Hide resolved
Comment on lines 207 to 212
The `EventsQueue` does not have a knowledge of QoS settings, this may cause some issues in some particular situations.

##### Unbound growth of event queue

As long as events are pushed into the queue, this will keep growing, regardless of the actual history size of the underlying middleware entity.
This may cause several events to accumulate, for example while the `EventsExecutor` is not spinning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the big advantage that the wait set based executor has is that it doesn't suffer for this two issues (that's already handled by the queue of each entity based on its QoS settings).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the wait set approach does not suffer from the unbound growth issue.
I'm sure it is possible to address that also in an event based executor, but is definitely not simple.

On the other hand, the second open issue mentioned here (i.e. that in some situations the order may be invalidated) is simply completely ignored by the wait set approach, where there are no guarantees at all about the order in which entities are processed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. it could happen that for a specific entity, there are more events in the executor queue than actually in the queue of the entity. Would the executor then try to take a message from a subscription and not execute the callback as it realized that the take did not return a message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@budrus it is possible to do that, but it requires additional APIs.
At the moment an executor in the rclcpp layer can only call execute_subscription(Subscription) which will take the first element from the underlying DDS queue of the subscription.
The events executor design would allow to attach IDs to the messages/events, but then the rclcpp layer should also provide an API execute_subscription(Subscription, id) that will execute only if it's the correct element.
You can find more details on the problem at the bottom of the design page (i.e. Invalid ordering corner case)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alsora. So it could happen that there is a execute_subscription(Subscription) without any action.
This not an error when we end up here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised statically allocated (bounded) buffers were not a requirement for resource constrained platforms -- often running time-sensitive applications. Large enough circular buffers would probably be OK for most such cases (whatever large enough means for any given application).

articles/events_executor.md Outdated Show resolved Hide resolved
@alsora
Copy link
Contributor

alsora commented Oct 22, 2020

@ivanpauno thank you for taking the time to evaluate our proposal.

Here some replies to your comment.

I have left some comments about the event based approach, but my overall feedback is that we should figure out what's a good default executor and only maintain one in rclcpp.
This doesn't mean that the one in rclcpp should be wait set based, I only mean that I would rather have one single threaded executor in rclcpp (which could be event queue based) instead of three.

If we're going to switch to a event based executor, it would be great to have more information to understand what's the real limit of the current wait set based approach (see my other comments).

Yes, the ROS client library should really provide only 1 executor that works well.
The points that we mentioned in the Requirements section, should be the guideline for how a ROS 2 executor should be.,
Our idea is that the EventsExecutor is a step forward in that direction.
The PR that we opened that implements this new executor (i.e. ros2/rclcpp#1416) on purpose does not modify any of the existing code, but just adds new APIs.
We think that this will allow to more easily merge the changes and at the same time start the process to carefully evaluating what we have with the goal to have a single, good, executor that work in all use cases for the Galactic release.

I also think that rmw/rcl API should be flexible enough so you can implement an event queue/work queue/wait set based executor out of tree, and I think that would be possible.
Based on your experience, do you think it's possible to implement an event based executor out of tree? What additions would be needed on rmw/rcl to make that possible?
Off the top of my head, I think that if rmw API supported attaching listeners to entities, that would make this possible.

In order to implement the EventsExecutor, we had to only add some boilerplate code in rmw/rcl layers.
You can have a look at the specific PRs
rcl ros2/rcl#839
rmw ros2/rmw#286
rmw_fastrtps ros2/rmw_fastrtps#468

It's just the same API copy-pasted for all the entities.

  • AFAIR the StaticSingleThreadExecutor, isn't actually static. It rebuilds the wait set when an entity is added, but it uses a synchronization mechanism instead of always rebuiling it (which is much more efficient).
    Based on that, it sounds like the StaticSingleThreadExecutor is always a better choice than the SingleThreadExecutor so I think that replacing the later should be completely removed.

  • We also have a rclcpp::WaitSet that we don't use in any of the wait set based executors.

Yes, the StaticSingleThreadedExecutor is not really static and from our experience it always performs better than the SingleThreadedExecutor.
It has some limitations though, as it does not fully support all the executor APIs and the concept of an ExecutableList which keeps ownership of the entities may cause some issues.

From a discussion I had with @wjwwood I remember that the rclcpp::WaitSet can be configured such that it will become completely static. That should in theory allow to match the performance of the StaticSingleThreadedExecutor, but it will not allow to add new entities at all.

However, from our analysis, both these solutions still cause a considerable CPU overhead with respect to running your pub/sub application without ROS and directly using your middleware of choice.
On the other hand, the EventsExecutor allows to get very close to the performance of the middleware.

@gbiggs
Copy link
Member

gbiggs commented Oct 27, 2020

I have left some comments about the event based approach, but my overall feedback is that we should figure out what's a good default executor and only maintain one in rclcpp.
This doesn't mean that the one in rclcpp should be wait set based, I only mean that I would rather have one single threaded executor in rclcpp (which could be event queue based) instead of three.

I strongly disagree with this. There is no reason that a client library should be limited to one executor. We should figure out what our use cases are and ensure that each is covered completely by one executor. If the collection of use cases requires more than one executor to completely cover, then so be it. There is also the possibility that a particular language may provide multiple ways to handle event queues and multi-threading, so one executor for each might make sense in that language's client library.

There should be one default executor, of course. When you call spin() without directly creating an executor, what executor is used inside that is an important choice we must make if we have multiple available.

We can identify some major contributors to this overhead:
- The cost of modifying and updating waitsets.
Currently this operation happens multiple times for every iteration of the executor, even if the majority of robotics systems are mostly static, i.e. the number of ROS 2 entities is fixed and known at startup.
A detailed walk-through of the waitset management process can be found [here](https://discourse.ros.org/t/ros2-middleware-change-proposal/15863/6).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try align with your discussion

There are some differences in the rmw implementations. Depending on whether a DDS wait set or just a condition variable is used, the number of "walk-throughs" vary or are at different places. I feel that with the waitset approach there are at least these steps

  1. check all the attached conditions (rmw entities) for work to do. If there is already work go to 4.
  2. there is no work to do, so wait for work
  3. after waiting check all the attached conditions (rmw entities) for work to do
  4. fill the output buffer

As you said all the clearing, attaching and detaching with every wait can be optimized. When the the attaching and detaching to an underlying DDS wait set would be done in the context of calls like rcl_wait_set_add_subscription(), I'm wondering if rmw_wait() would no more need the rmw entities as parameter but only the rmw_wait_set

The data structure so includes the type of the entity that generated the event and a handle to its corresponding rclcpp object.

Considering the entities currently available in ROS 2, the content of the event data structure can be any of the following:
- `ExecutorEventType::SUBSCRIPTION_EVENT` and an identifier for a `rclcpp::SubscriptionBase` object.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This identifier for a SubscriptionBase is the subscription_handle that is provided with the rmw_subscription_set_events_executor_callback call?

Copy link
Contributor

@alsora alsora Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.
In the implementation that we proposed (see the other PRs) that's a raw pointer to an rclcpp subscription itself.
However anything that will allow the executor to get what is needed to call execute_subscription() will do the work: for example the executor may have a map of weak pointers to subscriptions and the event may include a key for that map.

However, when calling the `execute_subscription()` API, the message that generated event 1 will not be available anymore as it has been overwritten from the message that generated event 3.
The `execute_subscription()` API will not fail, but rather it will process the message that generated event 3.
This violates the ordering as event 2 should have been processed before that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait set and the event queue approach both indicate that there is work to do and the actual work is grabbed later. The event queue has "normally" already a temporal order and less "walk through" and sorting than the wait set, but is also not deterministic as you describe here.

Determinism would be reached with a work queue or if the collected work is sorted before executed in case of event queue and wait set. So determinism would be possible if you pay with some runtime.

I'm wondering if it would be possible to have an intermediate API that provides a container with work to execute. If this is then optimized for performance, latency, determinism or however sorted is the job of the flexible part of the executor. But I'm not familiar enough with the upper layers of rcl and rclcpp to judge if this makes sense. This maybe goes in the direction of the out of tree proposal @ivanpauno had.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determinism would be reached with a work queue or if the collected work is sorted
before executed in case of event queue and wait set. So determinism would be possible
if you pay with some runtime.

Determinism could be reached implementing what @alsora proposed above:

The events executor design would allow to attach IDs to the messages/events,
but then the rclcpp layer should also provide an API execute_subscription(Subscription, id)
that will execute only if it's the correct element.

Regarding

I'm wondering if it would be possible to have an intermediate API that provides a container with work to execute.

The issue with a work queue is that a message taken to be processed later could be invalid by the time of processing. If the QoS depth of the queue is 1 (one) and you take the message, but another message arrives before processing it, the stored message in the work queue should not be processed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determinism could be reached implementing what @alsora proposed above:

The events executor design would allow to attach IDs to the messages/events,
but then the rclcpp layer should also provide an API execute_subscription(Subscription, id)
that will execute only if it's the correct element.

That's a pragmatic and efficient solution for a guaranteed temporal order. The middleware needs to provide a sneak preview for the id without taking the message but that's a solvable problem too

The issue with a work queue is that a message taken to be processed later could be invalid by the time of processing. If the QoS depth of the queue is 1 (one) and you take the message, but another message arrives before processing it, the stored message in the work queue should not be processed.

I see. That are the consequences of having two queues that are processed at different times. One in the middleware and one in the executor. The work queue looses the QoS depth, the event queue can fulfil it with the consequence of outdated NOP events as discussed above.

What I meant with

I'm wondering if it would be possible to have an intermediate API that provides a container with work to execute.

is the question if it would make sense to have an executor top layer API that provides the work to do without the need to go back to the middleware again (and the races this entails). This allows the most determinism as the executor then would provide a plan that is guaranteed to be executed. But I fear this is too far away from where we are today

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant with ... is the question if it would make sense to have an executor top layer API that provides the work to do without the need to go back to the middleware again (and the races this entails).

We have considered that approach, as it is one step closer to get better performances too. But for the sake of simplicity for a first proposal, and taking into account that we are already getting great performance improvements, we decided to leave it as a follow up improvement for the future. Looking at the CPU profiling flamegraphs we did (using rclcpp IPC) it doesn't seem to be much overhead of "go back to the middleware again" as the message passing occurs only on RCLCPP layer. The approach could have a bigger impact when not using rclcpp IPC.

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that rmw/rcl API should be flexible enough so you can implement an event queue/work queue/wait set based executor out of tree, and I think that would be possible.

I'm with @ivanpauno here. Not against EventsExecutor being part of the core, but in favor of putting the machinery in place for out-of-tree implementations to be the norm.

Then the executor can process these events in a FIFO manner, without need for expensive entities look-ups.
Processing an event results in different operations depending on the entity that generated it.

Timers are executed in a separate task by a timers manager, where they are kept in a priority queue sorted according to their expiration time. The task then has to monitor only the first element of the queue and can execute its callback as soon as it expires.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it reaches beyond the scope of this PR, bear in mind that eventually the TimersManager class will have to deal with different timers associated with different clocks. Perhaps a queue per clock type? Synchronous API for spin_once() implementation will probably suffer though.

This may happen before the `EventsExecutor` start to process those events or while it is processing them.

The current implementation addresses this problem by providing an additional function pointer to each associated entity.
This function pointer will be called whenever the entity's destructor is invoked and it will result in clearing the `EventsQueue` from any event that was generated by this entity. Note that this may block until the `EventsQueue` is done processing events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the EventsExecutor does not own entities, what would prevent entity destruction during event processing (and the resulting deadlock)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EventsExecutor does not own entities mainly to align with the current ROS 2 design paradigm where there are no APIs for removing/destroying entities.

There should not be a deadlock: the entity destructor and the event queue processing thread will all try to lock the same single mutex.

Assume the following: entity A and entity B are being destroyed and they both have events on the queue.
If A acquires the lock first, while holding this lock it can remove its own events from the queue.
Upon release the events executor may acquire the lock and it will execute B events.
Then B may acquire the lock and it will not have anything to remove.

There are definitely other solutions to handle this situation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, mutex prevents entity destruction during event processing

Hmm, that's true when destruction and execution happen in separate threads. But unless I'm missing something from your rclcpp based prototype (which I might), if any entity associated with the event executor is destroyed within one of the callbacks that consume events while execution_mutex_ is being held, on entity removal it will attempt to lock the execution_mutex_ again. And execution_mutex_ isn't recursive.

Copy link
Contributor

@alsora alsora Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that situation can cause a deadlock and making that mutex recursive should be enough to handle it.

However, again I'm not sure what the use-case for that could be.
In the current design the event types are used only to distinguish between entities.
This means that you would have a regular subscription that, depending on the content of the message it receives, could decide to destroy itself or maybe to destroy another entity?
I think that ROS does not really support this workflow as entities are usually handled with shared ownership.

A different solution that we evaluated consisted in having additional event types to denote that we want to remove an entity from the executor.
However this does not really implements easily as ROS lacks several APIs for removing/destroying entities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr I agree with @hidmic here:

Whether that's useful or desirable is of no consequence: the implementation must behave in a sane way.

To merge this, we should keep the existing ownership behavior, and therefore preserve the ability to "delete" (or start asynchronously deleting) entities that might be being used by the executor from within a callback.

Specifically, the executor should have weak ownership on the entities associated with it, and strong ownership when using them.


I'm fully conscious that the ownership of entities is a weak point of the current implementation, but IMO is something not related to the EventsExecutor design in particular, but rather to its implementation or to the executors in general.

If you want it to be as much of a "drop-in replacement" as possible (and I think that's a good thing for comparing them and making it easier to integrate) then the EventsExecutor should follow the existing behavior of shared ownership (i.e. user owns, executor has weak ownership). If it really is as decoupled as you say, then changing the executor implementation AND the sharing behavior is changing two things at once, and is generally not desired.

The weak ownership model is not performant as the "fast path" require you to lock all the entities every time. I haven't measured this overhead, so maybe it's not really comparable to the other performance issues in ROS, but it's definitely something that eventually needs to be addressed.

I think that should be evaluated and addressed separately from this proposed change in the executor design. It's not clear to me that "but it's definitely something that eventually needs to be addressed." is true.

I would prefer a shared ownership model where additional APIs allow you to easily handle entities (e.g. add/remove from node and add/remove from executor).

I don't believe this is substantially different from the current approach, because it still doesn't let you remove items from within a callback. You'd really only be able to have remove (which doesn't work in a callback) and async_remove/remove_later which would work in a callback, but that's not any different than releasing ownership of the shared pointer within a callback, imo.

Copy link
Contributor

@hidmic hidmic Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StaticSingleThreadedExecutor has shared ownership of entities. This model does not allow a callback to destroy other entities, as the executor will always keep 1 reference to them. It actually does not really allow to destroy entities at all without first removing the node from the executor.

That's a good point. I see shared ownership of entities in rclcpp::experimental::ExecutableList.

Considering your scenario (a callback that destroys an entity) it could first remove that entity from the executor and then destroy it.
However this would require many more changes to ROS.

I agree with you that we're digging a bit too much into implementation details here, but how so? Again, I could be wrong, but considering that callback groups have weak ownership and that removing an entity from the EventsExecutor effectively drops the push event callback and prunes related events, you just have to ensure you synchronize properly (as mentioned above). It won't trigger an event, sure, but that's a non-issue. Explicitly removing a node is safe too since you use guard conditions. Your design is pretty solid in that sense.

As @wjwwood says, we can discuss ownership later on. I'd much rather get your work reviewed and merged instead of growing the scope right now. With an emphasis on the rmw changes that enable this implementation and (potentially) future ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you all for the discussion.
We are updating the rclcpp implementation to have weak ownership of entities.
This will allow us also to measure the difference in performance between the two approaches.

Copy link

@mauropasse mauropasse Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

These are the results including the implementation of the EventsExecutor having weak ownership of entities, and strong ownership when using them. We'll soon push this changes to the rclcpp PR.

Copy link

@mauropasse mauropasse Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: We've improved the weak pointer implementation for the new ownership model. The performances are now much better, using just 5% more CPU than the no-ownership approach (which used raw pointers thus no lock() needed to get ownership) for the worst case scenario (our biggest topology at 100hz in a single core platform). The graph above has been updated reflecting the latest performances, as well as the rclcpp PR ros2/rclcpp#1416 including the improvements.

Comment on lines 207 to 212
The `EventsQueue` does not have a knowledge of QoS settings, this may cause some issues in some particular situations.

##### Unbound growth of event queue

As long as events are pushed into the queue, this will keep growing, regardless of the actual history size of the underlying middleware entity.
This may cause several events to accumulate, for example while the `EventsExecutor` is not spinning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised statically allocated (bounded) buffers were not a requirement for resource constrained platforms -- often running time-sensitive applications. Large enough circular buffers would probably be OK for most such cases (whatever large enough means for any given application).

- `ExecutorEventType::SUBSCRIPTION_EVENT` and an identifier for a `rclcpp::SubscriptionBase` object.
- `ExecutorEventType::SERVICE_EVENT` and an identifier for a `rclcpp::ServiceBase` object.
- `ExecutorEventType::CLIENT_EVENT` and an identifier for a `rclcpp::ClientBase` object.
- `ExecutorEventType::WAITABLE_EVENT` and an identifier for a `rclcpp::Waitable` object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud, I wonder why combining polymorphism and type tagging, as opposed to just using polymorphism (i.e. making everything a Waitable).

Copy link
Contributor

@alsora alsora Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that, but it would be a quite big change for ROS as currently the waitable and the subscriptionBase, serviceBase, etc are very distinguished from each other.

We really tried to provide a design that allows to implement the new executor while not disrupting the current ROS system.

@wjwwood
Copy link
Member

wjwwood commented Apr 2, 2021

Ok, I've been making progress on the rmw changes related to this. One big thing I did was remove the term listener from these pull requests. My reasoning being that we don't have any concept like "listener" in ros yet, and we might want it in the future, but for now, how the callbacks are called isn't really important. So I removed those in favor of names like rmw_subscription_set_on_new_message_callback rather than rmw_subscription_set_listener_callback. I'm open to feedback on this, but I think it was a good change.

I also improved the documentation and implementation for rcl.

I did a major refactor to rclcpp, making the callbacks type safe and exception safe. I did some other things too, like dropping the setter function for waitables, but I'll add more notes about that on the pr itself.

I still think we need some tests for rcl and/or rclcpp, and we need a implementation (even a dummy one) for rmw_connextdds. I'm not sure how much more I can get done tonight, but I'll at least run CI on it.

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-speed/20162/9

@adamdbrw
Copy link

@wjwwood is this work still ongoing?

@wjwwood
Copy link
Member

wjwwood commented Jul 1, 2021

@adamdbrw Yes, but slowly. There are pull requests implementing needed features for some of what is included here, but they didn't make it into galactic, and have been mostly ideal since, which is my fault. I expect we'll start making progress again soon.

@alsora
Copy link
Contributor

alsora commented Jul 1, 2021

@adamdbrw FYI we have branches where we implemented the whole design and that we are using for our testing/development.

The PRs mentioned in the first post implement the first part of the design (i.e. RMW listener APIs).
The executor itself can be found in this branch https://github.com/irobot-ros/rclcpp/tree/galactic-events-executor
This branch already includes the rclcpp PR mentioned in the first post and adds the events executor on top of it

@alsora
Copy link
Contributor

alsora commented Nov 19, 2021

Hi, we updated the design with some of the latest changes originated from the middelware WG discussions.

@timonegk
Copy link

Hi, we recently ran into the performance issues in ROS2 and tried the EventsExecutor from the iRobot forks and it worked phenomenally.
@wjwwood @alsora What is currently blocking the progress of merging the EventsExecutor, especially to rclcpp? As far as I can see, part of the implementation in ros2/rclcpp#1416 was merged in ros2/rclcpp#1579, but the executor itself was not merged and there is not even a pull request for it.
As the current state of the ROS2 executors is not really usable for high frequency systems, this would be a great improvement and I / we would be glad to help move this forward wherever we can.

@timonegk
Copy link

FYI, before someone else also does the same thing: We really needed the executor, so I rebased the work by @irobot-ros on the current master. This was mostly rclcpp, because most of the other stuff was already merged. I also needed a small number of additional commits for rmw_fastrtps and rcl.
The repositories can be found at

In addition, I created packages for ubuntu 22.04. They are available at the following release repositories:

A strange bug I experienced was that I had to revert ros2/rmw_fastrtps#583 because an invalid qos type was falsely triggered. I don't know why that happened but will investigate.

I hope this helps someone who also needs the executor, or it helps you when it will be merged.

@alsora
Copy link
Contributor

alsora commented May 25, 2022

We plan to create soon an "events executor" repository that only contains the executor and it's compatible with standard rclcpp library.

This will allow ROS 2 Humble users to use the events executor in an easy way.

In the meantime, we will also open new PRs to rclcpp to merge this executor in the mainline.

@Flova
Copy link

Flova commented Jun 27, 2022

@alsora any updates on the executor repo? :)

@mauropasse
Copy link

@timonegk this PR ros2/rmw_fastrtps#625 fixes the issue you got about an invalid qos type falsely triggered. So you don't have to revert the FastRTPS PR ros2/rmw_fastrtps#583

@timonegk
Copy link

@mauropasse Thanks, works great!

@Aposhian
Copy link

What are the next steps for this? Is this something that needs to be brought up in the middleware WG?

@alsora
Copy link
Contributor

alsora commented Aug 2, 2022

Hi, thank you everyone for your interest in the events-executor.
You can find here a Github repository containing an implementation of the executor for ROS 2 Humble: https://github.com/irobot-ros/events-executor

We plan to release this as a debian package as soon as the known limitations and bugs mentioned in the readme are addressed.
Contributions and feedbacks are highly appreciated!

@ijnek
Copy link

ijnek commented Aug 2, 2022

@alsora Thanks for sharing you and your teams amazing work! Is the plan to get this into rclcpp, or to keep it separate in this repo?

Once the debians are out, I can imagine people getting confused about having to add a dependency on irobot_events_executor even though the way the events_executor should be included:

#include "rclcpp/executors/events_executor/events_executor.hpp"

makes it look like its from rclcpp.

@alsora
Copy link
Contributor

alsora commented Aug 2, 2022

Yes the plan is definitely to merge this into rclcpp, the separate repo and the debian package are just intermediate solutions to let people play with it in the meantime.

It's now almost 2 years that we have a working implementation of the events executor (we first opened a rclcpp PR with it in 2020), but it may take a while before all the pieces and requirements are merged.

The reason for having the same paths as rclcpp was to simplify the transition, i.e. we could copy-paste the files into a PR for rclcpp and people that were using the executor from the separate repo wouldn't have to make any change to their source code.
However I see the point for the potential confusion.

We can probably just change the top-level directory from rclcpp to irobot_rclcpp.

P.S. if you have additional feedbacks and suggestions, feel free to open a ticket on the events executor repository or code PRs

@alsora
Copy link
Contributor

alsora commented Aug 23, 2022

@wjwwood @clalancette @ivanpauno this PR is open since almost 2 years.
Anything to review/modify before merging it?

@sandeepsuresh
Copy link

As i understand this Event Executor is a Single threaded executor. Is there an update to this to support multithreaded EventExecutor with Callback Groups? Please revert back.

@alsora
Copy link
Contributor

alsora commented Dec 22, 2022

The events executor by design can either be single thread or multi-thread.

The implementation currently available at https://github.com/irobot-ros/events-executor support two modes:

  • single thread
  • two threads: one thread is used to process timers callbacks and another thread is used to process everything else

Contributions are welcome to support an arbitrary number of threads.

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/questions-about-the-intrinsic-acquisition-of-osrc/28763/15

@russkel
Copy link

russkel commented Mar 9, 2023

So... LGTM?

@wjwwood
Copy link
Member

wjwwood commented Mar 9, 2023

We've recently had a discussion about what to do with this pr, and it's possible we can merge it and immediately mark it as historical. There are a few issues at play:

  • this document is now out-of-date (the events executor has changed)
  • we're trying to move away from design docs on this site and towards REPs and design docs in the code repositories (maybe a version of this should go into rclcpp as we bring the events executor into rclcpp)
  • there are lots of tasks competing for our limited resources and improvements to the events executor and moving it into rclcpp are higher priority than resolving this long-open pr

@alsora
Copy link
Contributor

alsora commented Feb 5, 2024

Hi, something strange happened to this PR.
It was merged last month, but apparently it doesn't include any changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.