Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recording CPU load decrease through spin_some() #743

Closed
wants to merge 1 commit into from

Conversation

adamdbrw
Copy link
Collaborator

@adamdbrw adamdbrw commented Apr 20, 2021

The rationale behind this PR is explained here: #737 and here: ros2/rclcpp#1637.

In short, it is cutting CPU load from ros2 bag record from 110% to 55% in a certain use-case on a certain platform without apparent drawbacks. Notably, the use-case is quite a representative automotive one and the platform is also unexceptional. I believe this extrapolates well into reducing CPU use overall when the number of executables (e. g. subscriptions) to process each second is high. There might be some additional testing needed to build even more confidence in the broader conclusion.

Update: I have no confidence in the generalized claim anymore, see comment below.

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
@adamdbrw adamdbrw requested a review from a team as a code owner April 20, 2021 08:19
@adamdbrw adamdbrw added enhancement New feature or request performance labels Apr 20, 2021
@emersonknapp
Copy link
Collaborator

I think I would like to understand why this different is so big, before merging it. Is the standard Executor used in spin underperforming in some way that we can easily fix instead? (That being said, since core is frozen and we are not, we could make this change in rosbag for Galactic with a TODO note to remove in favor of core change).

Would this difference show in a development machine using the rosbag2_performance benchmarks? The output of that suite would also help make this review easier.

@adamdbrw
Copy link
Collaborator Author

adamdbrw commented Apr 20, 2021

Of course, let me explain why the difference is so big. I tried to do so in the rclcpp issue linked in the description, notably this part:

With the case of ~4k executables per second, when calling wait_for_work, 3.3k calls to rcl_wait per each second are made, so it is mostly only one executable that is returned each time, which seems quite inefficient. I am not sure if this is by design (since it perhaps minimizes latency), certainly collections used to gather a bunch of executables for each rcl_wait call are underused.

When a 1 ms sleep is introduced after we miss the cache (before/after rcl_wait), only ~600 calls to rcl_wait per second are made while successfully executing the same number of callbacks per second.

Basically, this change causes the executables "cache" to be actually used , so that get_next_ready_executable_from_map is used more often in place of rcl_wait, and getting next executable from cache is a significantly cheaper operation. The 1 ms sleep allows rcl_wait to get more items into the cache per call, which means it has to be called less often.

Is the standard Executor used in spin underperforming in some way that we can easily fix instead?

Standard executor has spin_once, spin_some, spin_all, and spin, basically offers several ways to process executables already, where spin() is for consuming wait set items as promptly as possible (without any delay) and other methods offer different strengths. So what we are doing here is just using the standard executor in a different, but fully supported manner.

CPU measurements before and after, exactly the same setup:

before

after

The bag file played to get this comparison (play into record) contains some non-public messages and I agree that it would be great to get something that is easy to replicate. I will find some time to do this.

Bag file rough profile: 3.7GB in 452k messages over 100 seconds with about 200 topics.

This should be fully reproducible with benchmarking publishers.

@adamdbrw
Copy link
Collaborator Author

Running benchmarks didn't confirm that this gain is more generic, and in some cases (lower executables/sec count) CPU load is apparently even increased by this patch. So I am not a fan of this change anymore. I am investigating further. What is so crucially different in these scenarios? These results don't make much sense to me yet.

@emersonknapp
Copy link
Collaborator

Thanks for taking a deep dive into this, it is reassuring to see this level of rigor applied to an important issue.

I would be interested in a profiling result, to see exactly where in the code the 50% CPU usage difference comes from in your sample case - a function profile will probably tell us a lot.

@MichaelOrlov
Copy link
Contributor

@emersonknapp As far as I understand from @christophebedard analysis main overhead in executor in get_next_ready_executable that it checks its lists of timers/subscriptions/services/clients/waitables and returns once it has found one that is ready to execute.

Introducing sleep between spin_some just yielding CPU resources for another threads instead of dummy crawling by get_next_ready_executable in it's huge lists. However this is not "deterministic" approach and as @adamdbrw mentioned it could work beneficial for some scenarios and platforms and could make performance worse for another configurations.

I don't see an easy fix for this problem now.

While executors have a good abstraction and easy to use API they have a lot of overhead inside.
At the moment I anticipate that only using wait_sets directly in recorder instead of executors could help with substantial performance improvements.

@emersonknapp
Copy link
Collaborator

Perhaps using ros2/design#305 the "Events Executor" instead of a waitset-based one would help - supposedly the performance is way better.

@adamdbrw
Copy link
Collaborator Author

I will dive deeper into this

@adamdbrw adamdbrw marked this pull request as draft April 29, 2021 08:28
@audrow audrow changed the base branch from master to rolling June 28, 2022 14:50
@emersonknapp
Copy link
Collaborator

@adamdbrw is this PR still relevant, or should we close it as stale?

@adamdbrw
Copy link
Collaborator Author

adamdbrw commented Dec 4, 2022

Closing as stale

@adamdbrw adamdbrw closed this Dec 4, 2022
@MichaelOrlov MichaelOrlov deleted the rosbag2_cpu_load_decrease_spin_some branch July 13, 2024 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants