Recording CPU load decrease through spin_some() #743

adamdbrw · 2021-04-20T08:19:06Z

The rationale behind this PR is explained here: #737 and here: ros2/rclcpp#1637.

In short, it is cutting CPU load from ros2 bag record from 110% to 55% in a certain use-case on a certain platform without apparent drawbacks. Notably, the use-case is quite a representative automotive one and the platform is also unexceptional. I believe this extrapolates well into reducing CPU use overall when the number of executables (e. g. subscriptions) to process each second is high. There might be some additional testing needed to build even more confidence in the broader conclusion.

Update: I have no confidence in the generalized claim anymore, see comment below.

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>

emersonknapp · 2021-04-20T16:36:56Z

I think I would like to understand why this different is so big, before merging it. Is the standard Executor used in spin underperforming in some way that we can easily fix instead? (That being said, since core is frozen and we are not, we could make this change in rosbag for Galactic with a TODO note to remove in favor of core change).

Would this difference show in a development machine using the rosbag2_performance benchmarks? The output of that suite would also help make this review easier.

adamdbrw · 2021-04-20T17:17:32Z

Of course, let me explain why the difference is so big. I tried to do so in the rclcpp issue linked in the description, notably this part:

With the case of ~4k executables per second, when calling wait_for_work, 3.3k calls to rcl_wait per each second are made, so it is mostly only one executable that is returned each time, which seems quite inefficient. I am not sure if this is by design (since it perhaps minimizes latency), certainly collections used to gather a bunch of executables for each rcl_wait call are underused.

When a 1 ms sleep is introduced after we miss the cache (before/after rcl_wait), only ~600 calls to rcl_wait per second are made while successfully executing the same number of callbacks per second.

Basically, this change causes the executables "cache" to be actually used , so that get_next_ready_executable_from_map is used more often in place of rcl_wait, and getting next executable from cache is a significantly cheaper operation. The 1 ms sleep allows rcl_wait to get more items into the cache per call, which means it has to be called less often.

Is the standard Executor used in spin underperforming in some way that we can easily fix instead?

Standard executor has spin_once, spin_some, spin_all, and spin, basically offers several ways to process executables already, where spin() is for consuming wait set items as promptly as possible (without any delay) and other methods offer different strengths. So what we are doing here is just using the standard executor in a different, but fully supported manner.

CPU measurements before and after, exactly the same setup:

The bag file played to get this comparison (play into record) contains some non-public messages and I agree that it would be great to get something that is easy to replicate. I will find some time to do this.

Bag file rough profile: 3.7GB in 452k messages over 100 seconds with about 200 topics.

This should be fully reproducible with benchmarking publishers.

adamdbrw · 2021-04-22T11:03:33Z

Running benchmarks didn't confirm that this gain is more generic, and in some cases (lower executables/sec count) CPU load is apparently even increased by this patch. So I am not a fan of this change anymore. I am investigating further. What is so crucially different in these scenarios? These results don't make much sense to me yet.

emersonknapp · 2021-04-22T18:16:47Z

Thanks for taking a deep dive into this, it is reassuring to see this level of rigor applied to an important issue.

I would be interested in a profiling result, to see exactly where in the code the 50% CPU usage difference comes from in your sample case - a function profile will probably tell us a lot.

MichaelOrlov · 2021-04-27T08:08:21Z

@emersonknapp As far as I understand from @christophebedard analysis main overhead in executor in get_next_ready_executable that it checks its lists of timers/subscriptions/services/clients/waitables and returns once it has found one that is ready to execute.

Introducing sleep between spin_some just yielding CPU resources for another threads instead of dummy crawling by get_next_ready_executable in it's huge lists. However this is not "deterministic" approach and as @adamdbrw mentioned it could work beneficial for some scenarios and platforms and could make performance worse for another configurations.

I don't see an easy fix for this problem now.

While executors have a good abstraction and easy to use API they have a lot of overhead inside.
At the moment I anticipate that only using wait_sets directly in recorder instead of executors could help with substantial performance improvements.

emersonknapp · 2021-04-27T18:49:43Z

Perhaps using ros2/design#305 the "Events Executor" instead of a waitset-based one would help - supposedly the performance is way better.

adamdbrw · 2021-04-29T08:28:29Z

I will dive deeper into this

emersonknapp · 2022-12-04T03:08:13Z

@adamdbrw is this PR still relevant, or should we close it as stale?

adamdbrw · 2022-12-04T11:00:46Z

Closing as stale

CPU load decrease through spin_some()

c9874fa

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>

adamdbrw requested review from emersonknapp and Karsten1987 April 20, 2021 08:19

adamdbrw requested a review from a team as a code owner April 20, 2021 08:19

adamdbrw added enhancement New feature or request performance labels Apr 20, 2021

adamdbrw marked this pull request as draft April 29, 2021 08:28

audrow changed the base branch from master to rolling June 28, 2022 14:50

adamdbrw closed this Dec 4, 2022

MichaelOrlov deleted the rosbag2_cpu_load_decrease_spin_some branch July 13, 2024 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recording CPU load decrease through spin_some() #743

Recording CPU load decrease through spin_some() #743

adamdbrw commented Apr 20, 2021 •

edited

Loading

emersonknapp commented Apr 20, 2021

adamdbrw commented Apr 20, 2021 •

edited

Loading

adamdbrw commented Apr 22, 2021

emersonknapp commented Apr 22, 2021

MichaelOrlov commented Apr 27, 2021

emersonknapp commented Apr 27, 2021

adamdbrw commented Apr 29, 2021

emersonknapp commented Dec 4, 2022

adamdbrw commented Dec 4, 2022

Recording CPU load decrease through spin_some() #743

Recording CPU load decrease through spin_some() #743

Conversation

adamdbrw commented Apr 20, 2021 • edited Loading

emersonknapp commented Apr 20, 2021

adamdbrw commented Apr 20, 2021 • edited Loading

adamdbrw commented Apr 22, 2021

emersonknapp commented Apr 22, 2021

MichaelOrlov commented Apr 27, 2021

emersonknapp commented Apr 27, 2021

adamdbrw commented Apr 29, 2021

emersonknapp commented Dec 4, 2022

adamdbrw commented Dec 4, 2022

adamdbrw commented Apr 20, 2021 •

edited

Loading

adamdbrw commented Apr 20, 2021 •

edited

Loading