Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow publishing and performance for custom messages with large arrays #346

Closed
berndpfrommer opened this issue Oct 14, 2021 · 21 comments
Closed

Comments

@berndpfrommer
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • ROS2 galactic standard ubuntu package installed via apt
  • Version or commit hash:
    This is what the apt package info says: ros-galactic-cyclonedds/focal,now 0.8.0-5focal.20210608.002038 amd64 [installed,automatic]
  • DDS implementation:
    • cyclonedds
  • Client library (if applicable):
    • rclcpp
  • Hardware:
    AMD Ryzen 7 4800H with 64GB of memory

Steps to reproduce issue

I have made a very small repo with the below demo code and instructions how to run:
https://github.com/berndpfrommer/ros2_issues
Here is the source code for the publisher:

#include <unistd.h>

#include <rclcpp/rclcpp.hpp>
#include <ros2_issues/msg/test_array_complex.hpp>
#include <thread>

template <class MsgType>
struct TestPublisher : public rclcpp::Node
{
  explicit TestPublisher(const rclcpp::NodeOptions & options)
  : Node("test_publisher", options)
  {
    pub_ = create_publisher<MsgType>(
      "~/array", declare_parameter<int>("q_size", 1000));
    thread_ = std::thread([this]() {
      rclcpp::Rate rate(declare_parameter<int>("rate", 1000));
      const int numElements = declare_parameter<int>("num_elements", 100);
      rclcpp::Time t_start = now();
      size_t msg_cnt(0);
      const rclcpp::Duration logInterval = rclcpp::Duration::from_seconds(1.0);
      while (rclcpp::ok()) {
        MsgType msg;
        msg.header.stamp = now();
        msg.elements.resize(numElements);
        pub_->publish(msg);
        rate.sleep();
        msg_cnt++;
        rclcpp::Time t = now();
        const rclcpp::Duration dt = t - t_start;
        if (dt > logInterval) {
          RCLCPP_INFO(get_logger(), "pub rate: %8.2f", msg_cnt / dt.seconds());
          t_start = t;
          msg_cnt = 0;
        }
      }
    });
  }
  // -- variables                                                                                                                                                                                                  
  typename rclcpp::Publisher<MsgType>::SharedPtr pub_;
  std::thread thread_;
};

int main(int argc, char * argv[])
{
  rclcpp::init(argc, argv);
  auto node =
    std::make_shared<TestPublisher<ros2_issues::msg::TestArrayComplex>>(
      rclcpp::NodeOptions());
  rclcpp::spin(node);
  rclcpp::shutdown();
  return 0;
}

It uses the following custom message for TestArrayComplex:

std_msgs/Header header
# test array of elements
TestElement[] elements

and the TestElement of the array is defined as:

uint16 x
uint16 y
builtin_interfaces/Time ts
bool polarity

Expected behavior

Under ROS1 I can publish 1000 msgs/sec with 100,000 elements per message and receive at a rate of 1000Hz with rostopic hz

Actual behavior

Under ROS2 (galactic), already the publishing fails to keep up at a message size of 5,000 elements. Running the publisher with

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run ros2_issues publisher_node --ros-args -p num_elements:=5000 -p rate:=1000

produces this output:

634242922.738549 [0] publisher_: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0
[INFO] [1634242923.747148904] [test_publisher]: pub rate:   369.34
[INFO] [1634242924.749220049] [test_publisher]: pub rate:   373.22
...

So not even the publishing is full speed, without any subscriber to the topic. I see the publisher running at 100 %CPU, so something is really heavy weight about publishing.
Worse, running rostopic hz shows a rate of about 30 msg/s.

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic hz -w 100 /test_publisher/array
1634243031.116776 [0]       ros2: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0
average rate: 31.275
        min: 0.030s max: 0.041s std dev: 0.00314s window: 33
average rate: 30.502
        min: 0.030s max: 0.044s std dev: 0.00351s window: 63

This is what I get from rostopic bw. The size of the message (about 80kb) agrees with what I computed by hand:

5.20 MB/s from 100 messages
        Message size mean: 0.08 MB min: 0.08 MB max: 0.08 MB

I tried sudo` sysctl -w net.core.rmem_max=8388608 net.core.rmem_default=8388608 and also was able to restrict the interface to loopback (lo) but no improvement.

FastRTPS is a bit better, at least here I can send messages with up to 50,000 elements before it falls off at 110,000 messages:

RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run ros2_issues publisher_node_complex --ros-args -p num_elements:=110000 -p rate:=1000                                       
[INFO] [1634243372.579736663] [test_publisher]: pub rate:   404.43
[INFO] [1634243373.581322637] [test_publisher]: pub rate:   391.37
[INFO] [1634243374.583210605] [test_publisher]: pub rate:   414.22

But if I send messages of size 5,000, rostopic hz also shows about 30Hz, similar to rmw_cyclonedds_cpp.

Additional information

This is a show stopper for porting e.g. a driver for an event based camera from ROS1 to ROS2, see here: https://github.com/berndpfrommer/metavision_ros_driver.

The hardware is a 8-core AMD Ryzen laptop, less than 1 year old, so definitely not a slow machine, and this is all running on-machine, no network traffic.

To run the above code it is fastest to clone the very small repo linked above.

@berndpfrommer
Copy link
Author

Just tried swapping in iceoryx for rmw. The publishing rate is comparable to fastrtps, but the topics are not visible so I cannot get rostopic hz working with it.

RMW_IMPLEMENTATION=rmw_iceoryx_cpp ros2 run ros2_issues publisher_node_complex --ros-args -p num_elements:=50000 -p rate:=1000                                          
Log level set to: [Warning]                                                                                                                                                                                        
2021-10-14 17:41:59.120 [Warning]: Requested queue capacity 1000 exceeds the maximum possible one for this subscriber, limiting from 1000 to 256                                                                   
[INFO] [1634247720.124649358] [test_publisher]: pub rate:   291.36                                                                                                                                                 
[INFO] [1634247721.125632766] [test_publisher]: pub rate:   343.66                                                                                                                                                 

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-speed/20162/18

@eboasson
Copy link
Collaborator

Hi @berndpfrommer, it is always a pleasure when someone who runs into a problem is so careful in explaining exactly what they did, including providing a works-at-first-try reproducer. (Of course it is not a pleasure that you did run into this, please don't misunderstand me!)

There are several aspects to this. I'm pretty sure my observations qualitatively match your observations, but definitely not quantitatively, as I did some experimentation on an Apple M1-based MBP, obviously running macOS instead of Linux. This with the 5k elements from the copy-pasteable command-line.

The first issue is that the "ros2 topic hz" really cannot keep up, because it spends 97% of its time copying the data into Python. (This screenshot of my profiler is "focussed" on that point, the stack trace is so deep that it wouldn't otherwise fit on my screen.)
image

The second bit is that with the subscriber present, the publisher is not writing faster than just over 380 msg/s (similar to your observation):

[INFO] [1634305581.100708000] [test_publisher]: pub rate:   391.31
[INFO] [1634305582.103234000] [test_publisher]: pub rate:   385.03
[INFO] [1634305583.104760000] [test_publisher]: pub rate:   386.41
[INFO] [1634305584.104873000] [test_publisher]: pub rate:   385.96

but with the subscriber not present (and so the data doesn't even go over the network), it only goes up a little bit:

[INFO] [1634306300.465061000] [test_publisher]: pub rate:   419.97
[INFO] [1634306301.465243000] [test_publisher]: pub rate:   420.92
[INFO] [1634306302.467635000] [test_publisher]: pub rate:   418.00
[INFO] [1634306303.469414000] [test_publisher]: pub rate:   415.26

This is because nearly all the time is spent in the serializer:

image

As you can see, almost all the time is spent in a custom serialiser for ROS messages, part of rmw_cyclonedds_cpp, which is this costly because it relies on the introspection type support. That it does so has no deep reason, it is simply because I didn't know how the ROS-message-to-IDL compilation pipeline was supposed to work at the time — actually I still don't know those details — and also because most of the time the high-rate large messages that I have seen end up being arrays of primitive types (e.g., point clouds tend to be a small header followed by a huge array of bytes). In your case, however, it goes through all that type information, field-by-field, element-by-element. And that adds up ... (That said, it does look to me like there must be some low-hanging fruit there, given that it spends so little time in "put bytes", but you never really know with profiling release builds.)

In short, your message is one of those types where having such a simple serialiser really hurts. The Fast-RTPS RMW layer uses code generation, and I am willing to bet that makes all the difference and that doing the same in Cyclone's RMW layer would fix the problem. But that takes time ...

If it is the same to you, I expect performance will improve a lot if you replace the array-of-structs with a struct-of-arrays. This fits with a recent set of performance measurements we did with https://gitlab.com/ApexAI/performance_test/ on a machine comparable to yours, and with a quick run with ddsperf that says I get 35k msg/s ~ 22Gb/s over the loopback interface for "trivial" 80kB messages without tweaking the configuration.

Then some final remarks: raising the socket receive buffer like you did indeed usually helps if the messages become large, but with only 5k elements, a single message is a mere 80kB and that's not going to change anything (it would for the 100k elements). What would help a bit is raising the maximum UDP datagram size that Cyclone uses to almost 64kB (like Fast-RTPS does by default).

Using the Iceoryx integration to make the data available to the subscriber integration without involving a network is best of course, but at the moment that still has a few requirements that would require some work on your side, primarily that the message must have a fixed size (we'll lift that restriction, but the work on that hasn't been completed yet). It is also important that all processes agree on the representation of the data, so mixing C++ and Python is a problem, but if all your actual application code is in C++ then that requirement is trivially met.

@berndpfrommer
Copy link
Author

Hi @eboasson, thanks for the insightful reply!

Before I forget it... what profiler are you using there? That tool would have helped me a lot. Would I need to recompile the ROS2 comms layers with instrumentation code for this to work? I used valgrind in the past but often the runtime overhead is so large that the program can no longer be tested in a meaningful way.

I suspected it had to do with the complexity of the messages. You can see from my demo repo that I also tried simpler messages and I noticed things were better then but still not ROS1 level. I didn't mention this because I didn't understand it since I was not aware of the element-by-element marshalling of the messages. I always thought that this would be taken care by some precompilation step but today I learned otherwise. Thank you!

Morgan fixed the problem for me, see discourse link above, so in a sense you can close this issue.
The underlying problem remains though: that even as a fairly proficient ROS1 user I was struggling to get stuff to work on ROS2. I think some documentation in the ROS1 -> ROS2 transition guide would be helpful. Also, the fact that rostopic hz is so slow that it becomes useless as a monitoring tool for high frequency data needs to be documented. In hindsight a bit embarrassing that I didn't catch it because I did see it run at 100% CPU, but I didn't think that would be the reason for it to drop down to 30Hz.

Will reply some more on discourse.

@eboasson
Copy link
Collaborator

That was an interesting discussion on discourse, but I am still glad I put the effort in responding here, too.

Before I forget it... what profiler are you using there? That tool would have helped me a lot. Would I need to recompile the ROS2 comms layers with instrumentation code for this to work? I used valgrind in the past but often the runtime overhead is so large that the program can no longer be tested in a meaningful way.

That's Apple's profiler. On Linux, I'd use https://github.com/brendangregg/FlameGraph. Both work fine without recompiling anything you can just profile a release build (well, RelWithDebInfo in CMake terms, or else you will get a nice profile but it'll be mostly numbers that are nearly impossible to make sense of).

@eboasson
Copy link
Collaborator

Given that there's a lot of info on discourse and that the immediate problem has been addressed, I'll close this issue. Don't hesitate to open new ones (or even revive this one if that's appropriate) if you have more questions or — unfortunately — run into problems again.

@berndpfrommer
Copy link
Author

Other users on discourse have requested that I create an open issue on this such that the performance issues are addressed or at least documented that way:
https://discourse.ros.org/t/ros2-speed/20162/30?u=bernd_pfrommer
Would you mind simply re-opening this issue? Would save me the work of creating an almost identical issue.
Also somebody asked to try iceoryx:
https://discourse.ros.org/t/ros2-speed/20162/31?u=bernd_pfrommer
Would iceoryx make much difference, given that the bottleneck seems to be serializing/deserializing?

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-speed/20162/32

@christophebedard
Copy link
Member

christophebedard commented Nov 3, 2021

Instead of keeping an issue open for documentation (there might be other reasons to keep it open), we should find somewhere under the ROS 2 docs to put a summary of this: https://github.com/ros2/ros2_documentation. It looks like there's already a page for this: https://docs.ros.org/en/rolling/How-To-Guides/DDS-tuning.html

@berndpfrommer
Copy link
Author

@christophebedard from what I understand this is not really DDS tuning. It would be a new section: how to work around slow ROS2 serialization? At this point I'm not even sure it's a ROS2 issue or just serialization slowness for all RMW's that I've tested (fastrtps and cyclonedds).
And I agree that for documentation a new issue should be created or better, a PR with improved docs should be submitted. But my preference would be to have the issue fixed rather than document further so for that alone an issue should be kept open. Or is the serialization performance drop between ROS1 and ROS2 unfixable?

@berndpfrommer
Copy link
Author

Never mind, I squeezed it under the DDS tuning section for now:
ros2/ros2_documentation#2096

@christophebedard
Copy link
Member

It would be a new section: how to work around slow ROS2 serialization? At this point I'm not even sure it's a ROS2 issue or just serialization slowness for all RMW's that I've tested (fastrtps and cyclonedds).

Sorry, my point was mostly about documenting some of the practical tuning tips/workarounds that were mentioned here and over on Discourse. However, I see that most of it is actually already documented for Cyclone DDS on the page I linked to.

But my preference would be to have the issue fixed rather than document further so for that alone an issue should be kept open. Or is the serialization performance drop between ROS1 and ROS2 unfixable?

I can't comment on whether it can be fixed or not. I assume it can at least be improved, but not right away, so documenting it alongside the other DDS tuning/tips is a first step (which you have done, thank you).

Perhaps this issue should be re-named to specifically mention "high (de)serialization cost of custom messages with large variable-sized arrays of non-primitive types" and re-opened. Then the ROS 2 docs can link to it. I'll let the actual maintainers consider it.

@christophebedard
Copy link
Member

Perhaps this issue should be re-named to specifically mention "high (de)serialization cost of custom messages with large variable-sized arrays of non-primitive types" and re-opened.

Although it's not a Cyclone-specific issue from what I understand.

@fake-name
Copy link

fake-name commented Jun 22, 2022

"high (de)serialization cost of custom messages with large variable-sized arrays of non-primitive types"

It's also not specific to custom messages, it's true of any message with any non-primitive type over any DDS.

There is something seriously, seriously wrong with either the implementation or the design of the serialization for it to perform as badly as it does.

@eboasson
Copy link
Collaborator

Isn't it simply a case of an array of non-primitive types mapping onto a list of objects in Python, and so that you're running into the overhead of creating and deleting tons upon tons of objects?

I'm no Python expert and I really don't have the time or knowledge to try to improve the performance of the Python binding of ROS 2, but it is painful to see your disappointment. Thinking out loud: most data types tend to be numerical, maybe it'd be possible to map the data to numpy arrays? Would that be difficult?

@clalancette
Copy link
Contributor

If we are talking about Python, serialization of large messages can still be slow, depending on the exact types of messages used. You can see some of the bug reports at ros2/rclpy#856 , ros2/rclpy#836 , and ros2/rclpy#630 .

The good news is that post-Foxy, we've improved the Python serialization for certain types. But there are still types that are slow, and need someone to look at them.

Thinking out loud: most data types tend to be numerical, maybe it'd be possible to map the data to numpy arrays? Would that be difficult?

So we actually do that for certain types already. It's been a while since I looked at this, but I think that a) not all types are changed to numpy types, and b) even for the types that are, we are not always using the types in the most efficient way.

Anyway, if we are talking about Python, then the problem probably isn't CycloneDDS specific. If your issue is similar to one of the ones I listed above, I suggest you watch those (or even better, contribute a pull request to fix it up). If your issue is different, please feel free to open a new issue over on https://github.com/ros2/rclpy .

@fake-name
Copy link

fake-name commented Jun 25, 2022

My issues so far have been entirely in C++.

From what I can tell, basically the issue boils down to the way serialization in (at least) rmw_cyclonedds being a extremely naïve implementation. For every single non-POD type, you incur a hashmap lookup and multiple function calls of indirection.

My problem here is that doing high performance message serialization if you know the messages at compile time is basically a solved problem. Things like google protobuf came out in 2001. Did no one look at how it (or any competing message serialization libraries) worked?

How this should work is that the codegen step should generate a single function (or possibly several functions) that handles (de)serializaing a complete message of <type> from the wire format to a C++/Python data structure. The entire dynamic lookup parsing disaster in the middleware is a mistake, and it also has the side-effect of preventing things like the compiler inlining functions.

ROS2 doesn't support defining new messages at runtime, so there should be no dynamic runtime message parsing at all.

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/high-cpu-load-for-simple-python-nodes/28324/18

@daisukes
Copy link

I had a similar issue here and found a solution for my environment.

  • humble, navigation2, cyclone dds
  • a large-size static map (3000 x 2000), global costmap always publishes the full map (1Hz)
  • rviz2 and ros2 bag record subscribe to the global costmap, velodyne_points and so on
  • velodyne_packets is 10Hz, but velodyne_points is around 6Hz and delayed

I checked some documents/comments about tuning.

Only rmem settings did not work, but both rmem and wmem work like a charm!
In my case -16MB did not fix the issue, but 64MB works well.

sudo sysctl -w net.core.rmem_max=67108864 net.core.rmem_default=67108864
sudo sysctl -w net.core.wmem_max=67108864 net.core.wmem_default=67108864

@jimmyshe
Copy link

sudo sysctl -w net.core.rmem_max=67108864 net.core.rmem_default=67108864
sudo sysctl -w net.core.wmem_max=67108864 net.core.wmem_default=67108864

This is a working solution.
I tested on Ubuntu 22.04 with humble and cyclone dds.

@berndpfrommer
Copy link
Author

Sorry @daisukes and @jimmyshe I believe you are taking this issue off-topic. The original issue I opened arises with custom messages and is due to serialization and could not be fixed with some network memory settings as your comments indicate. The issues you are mentioning likely have a different underlying root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants