-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow publishing and performance for custom messages with large arrays #346
Comments
Just tried swapping in iceoryx for rmw. The publishing rate is comparable to fastrtps, but the topics are not visible so I cannot get rostopic hz working with it.
|
This issue has been mentioned on ROS Discourse. There might be relevant details there: |
Hi @berndpfrommer, it is always a pleasure when someone who runs into a problem is so careful in explaining exactly what they did, including providing a works-at-first-try reproducer. (Of course it is not a pleasure that you did run into this, please don't misunderstand me!) There are several aspects to this. I'm pretty sure my observations qualitatively match your observations, but definitely not quantitatively, as I did some experimentation on an Apple M1-based MBP, obviously running macOS instead of Linux. This with the 5k elements from the copy-pasteable command-line. The first issue is that the "ros2 topic hz" really cannot keep up, because it spends 97% of its time copying the data into Python. (This screenshot of my profiler is "focussed" on that point, the stack trace is so deep that it wouldn't otherwise fit on my screen.) The second bit is that with the subscriber present, the publisher is not writing faster than just over 380 msg/s (similar to your observation):
but with the subscriber not present (and so the data doesn't even go over the network), it only goes up a little bit:
This is because nearly all the time is spent in the serializer: As you can see, almost all the time is spent in a custom serialiser for ROS messages, part of In short, your message is one of those types where having such a simple serialiser really hurts. The Fast-RTPS RMW layer uses code generation, and I am willing to bet that makes all the difference and that doing the same in Cyclone's RMW layer would fix the problem. But that takes time ... If it is the same to you, I expect performance will improve a lot if you replace the array-of-structs with a struct-of-arrays. This fits with a recent set of performance measurements we did with https://gitlab.com/ApexAI/performance_test/ on a machine comparable to yours, and with a quick run with Then some final remarks: raising the socket receive buffer like you did indeed usually helps if the messages become large, but with only 5k elements, a single message is a mere 80kB and that's not going to change anything (it would for the 100k elements). What would help a bit is raising the maximum UDP datagram size that Cyclone uses to almost 64kB (like Fast-RTPS does by default). Using the Iceoryx integration to make the data available to the subscriber integration without involving a network is best of course, but at the moment that still has a few requirements that would require some work on your side, primarily that the message must have a fixed size (we'll lift that restriction, but the work on that hasn't been completed yet). It is also important that all processes agree on the representation of the data, so mixing C++ and Python is a problem, but if all your actual application code is in C++ then that requirement is trivially met. |
Hi @eboasson, thanks for the insightful reply! Before I forget it... what profiler are you using there? That tool would have helped me a lot. Would I need to recompile the ROS2 comms layers with instrumentation code for this to work? I used valgrind in the past but often the runtime overhead is so large that the program can no longer be tested in a meaningful way. I suspected it had to do with the complexity of the messages. You can see from my demo repo that I also tried simpler messages and I noticed things were better then but still not ROS1 level. I didn't mention this because I didn't understand it since I was not aware of the element-by-element marshalling of the messages. I always thought that this would be taken care by some precompilation step but today I learned otherwise. Thank you! Morgan fixed the problem for me, see discourse link above, so in a sense you can close this issue. Will reply some more on discourse. |
That was an interesting discussion on discourse, but I am still glad I put the effort in responding here, too.
That's Apple's profiler. On Linux, I'd use https://github.com/brendangregg/FlameGraph. Both work fine without recompiling anything you can just profile a release build (well, |
Given that there's a lot of info on discourse and that the immediate problem has been addressed, I'll close this issue. Don't hesitate to open new ones (or even revive this one if that's appropriate) if you have more questions or — unfortunately — run into problems again. |
Other users on discourse have requested that I create an open issue on this such that the performance issues are addressed or at least documented that way: |
This issue has been mentioned on ROS Discourse. There might be relevant details there: |
Instead of keeping an issue open for documentation (there might be other reasons to keep it open), we should find somewhere under the ROS 2 docs to put a summary of this: https://github.com/ros2/ros2_documentation. It looks like there's already a page for this: https://docs.ros.org/en/rolling/How-To-Guides/DDS-tuning.html |
@christophebedard from what I understand this is not really DDS tuning. It would be a new section: how to work around slow ROS2 serialization? At this point I'm not even sure it's a ROS2 issue or just serialization slowness for all RMW's that I've tested (fastrtps and cyclonedds). |
Never mind, I squeezed it under the DDS tuning section for now: |
Sorry, my point was mostly about documenting some of the practical tuning tips/workarounds that were mentioned here and over on Discourse. However, I see that most of it is actually already documented for Cyclone DDS on the page I linked to.
I can't comment on whether it can be fixed or not. I assume it can at least be improved, but not right away, so documenting it alongside the other DDS tuning/tips is a first step (which you have done, thank you). Perhaps this issue should be re-named to specifically mention "high (de)serialization cost of custom messages with large variable-sized arrays of non-primitive types" and re-opened. Then the ROS 2 docs can link to it. I'll let the actual maintainers consider it. |
Although it's not a Cyclone-specific issue from what I understand. |
It's also not specific to custom messages, it's true of any message with any non-primitive type over any DDS. There is something seriously, seriously wrong with either the implementation or the design of the serialization for it to perform as badly as it does. |
Isn't it simply a case of an array of non-primitive types mapping onto a list of objects in Python, and so that you're running into the overhead of creating and deleting tons upon tons of objects? I'm no Python expert and I really don't have the time or knowledge to try to improve the performance of the Python binding of ROS 2, but it is painful to see your disappointment. Thinking out loud: most data types tend to be numerical, maybe it'd be possible to map the data to numpy arrays? Would that be difficult? |
If we are talking about Python, serialization of large messages can still be slow, depending on the exact types of messages used. You can see some of the bug reports at ros2/rclpy#856 , ros2/rclpy#836 , and ros2/rclpy#630 . The good news is that post-Foxy, we've improved the Python serialization for certain types. But there are still types that are slow, and need someone to look at them.
So we actually do that for certain types already. It's been a while since I looked at this, but I think that a) not all types are changed to numpy types, and b) even for the types that are, we are not always using the types in the most efficient way. Anyway, if we are talking about Python, then the problem probably isn't CycloneDDS specific. If your issue is similar to one of the ones I listed above, I suggest you watch those (or even better, contribute a pull request to fix it up). If your issue is different, please feel free to open a new issue over on https://github.com/ros2/rclpy . |
My issues so far have been entirely in C++. From what I can tell, basically the issue boils down to the way serialization in (at least) My problem here is that doing high performance message serialization if you know the messages at compile time is basically a solved problem. Things like google protobuf came out in 2001. Did no one look at how it (or any competing message serialization libraries) worked? How this should work is that the codegen step should generate a single function (or possibly several functions) that handles (de)serializaing a complete message of ROS2 doesn't support defining new messages at runtime, so there should be no dynamic runtime message parsing at all. |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/high-cpu-load-for-simple-python-nodes/28324/18 |
I had a similar issue here and found a solution for my environment.
I checked some documents/comments about tuning.
Only rmem settings did not work, but both rmem and wmem work like a charm!
|
This is a working solution. |
Sorry @daisukes and @jimmyshe I believe you are taking this issue off-topic. The original issue I opened arises with custom messages and is due to serialization and could not be fixed with some network memory settings as your comments indicate. The issues you are mentioning likely have a different underlying root cause. |
Bug report
Required Info:
This is what the apt package info says:
ros-galactic-cyclonedds/focal,now 0.8.0-5focal.20210608.002038 amd64 [installed,automatic]
AMD Ryzen 7 4800H with 64GB of memory
Steps to reproduce issue
I have made a very small repo with the below demo code and instructions how to run:
https://github.com/berndpfrommer/ros2_issues
Here is the source code for the publisher:
It uses the following custom message for
TestArrayComplex
:and the TestElement of the array is defined as:
Expected behavior
Under ROS1 I can publish 1000 msgs/sec with 100,000 elements per message and receive at a rate of 1000Hz with
rostopic hz
Actual behavior
Under ROS2 (galactic), already the publishing fails to keep up at a message size of 5,000 elements. Running the publisher with
produces this output:
So not even the publishing is full speed, without any subscriber to the topic. I see the publisher running at 100 %CPU, so something is really heavy weight about publishing.
Worse, running
rostopic hz
shows a rate of about 30 msg/s.This is what I get from rostopic bw. The size of the message (about 80kb) agrees with what I computed by hand:
I tried
sudo` sysctl -w net.core.rmem_max=8388608 net.core.rmem_default=8388608
and also was able to restrict the interface to loopback (lo) but no improvement.FastRTPS is a bit better, at least here I can send messages with up to 50,000 elements before it falls off at 110,000 messages:
But if I send messages of size 5,000,
rostopic hz
also shows about 30Hz, similar to rmw_cyclonedds_cpp.Additional information
This is a show stopper for porting e.g. a driver for an event based camera from ROS1 to ROS2, see here: https://github.com/berndpfrommer/metavision_ros_driver.
The hardware is a 8-core AMD Ryzen laptop, less than 1 year old, so definitely not a slow machine, and this is all running on-machine, no network traffic.
To run the above code it is fastest to clone the very small repo linked above.
The text was updated successfully, but these errors were encountered: