Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing large data is 30x-100x slower than for rclcpp #763

Closed
karl-schulz opened this issue Apr 8, 2021 · 12 comments
Closed

Publishing large data is 30x-100x slower than for rclcpp #763

karl-schulz opened this issue Apr 8, 2021 · 12 comments
Assignees

Comments

@karl-schulz
Copy link

karl-schulz commented Apr 8, 2021

TL;DR: Publishing a 10MB Pointcloud takes only 2.8 ms in rclypp but 92 ms in rclpy.

We observed significant latency problems when publishing large data like images or pointclouds via rclpy. Investigating the problem, we found that publishing large data of any kind is 30-100x slower in Python than in C++.

We found the Python Publisher.publish(msg) call takes especially long compared to rclcpp. Publishing a 640x480 Pointcloud of size ~10MB with 15 FPS is not possible with rclpy even on our quite powerful machine (Ryzen 3900X), let alone our robot (Jetson Xavier NX). The time taken for both seems to be about linear with the data size.

Reproduction: https://github.com/karl-schulz/ros2_latency

Bug report

  • Operating System:
    • Ubuntu 20.04 and Ubuntu 18.04
  • Installation type:
    • Both binaries and built from source
  • Version or commit hash:
    • (like in current binaries / rclpy foxy branch)
  • DDS implementation:
    • Both Fast-RTPS and CycloneDDS

Steps to reproduce issue

I created a package for easy reproduction:
https://github.com/karl-schulz/ros2_latency

Basically, it consists of:

  • A node source publishing a random pointcloud of X MB size, X is controllable over a parameter
  • Two repeaters for C++ and Python, respectively, which
    • Copy the pointcloud
    • Update the timestamp
    • Re-publish it on different topics
  • A node measure which subscribes to both topics and compares the timestamps with the current ROS time

Expected behavior

  • rclpy and rclcpp being close in performance, with slightly higher latency for rclpy
  • Both rclpy and rclcpp being able topublish a Pointcloud of a 640x480 RGB-D image with at least 30 FPS on a fast machine

Actual behavior

  • Even on a very fast desktop machine, rclpy takes ~100ms for publishing 10MB of pointcloud data
  • For comparison, rclcpp take only between 1/30 and 1/100 of the time
  • Using rclpy with high FPS is not possible

Additional information

Tested on:

  • A desktop PC with a Ryzen 3800X and Ubuntu 20
  • Another machine with a Rytzen 3700X and Ubuntu 18 with ROS2 Foxy built from source
  • A NVIDIA Jetson Xavier NX L4T (Ubuntu 18) and ROS2 Foxy built from source

Tried Solutions

We were pretty desperate and already tried:

  • Switching the RWM from FastRTPS to CycloneDDS
    (however the good performance on rclcpp indicates that it is not a middleware problem)
  • Tuning CycloneDDS with a custom config XML and a higher system package size, like described here:
    https://index.ros.org/doc/ros2/Troubleshooting/DDS-tuning/
    (but the same rationale applies)
  • Using PYTHONOPTIMIZE=0
  • Different QOS settings: "best effort"/"reliable", depth=1/depth=10
  • Publishing different message types, e.g. Image messages instead of Pointloud2

This problem is actually really critical for our application. In the worst case: Is this expected behavior? This would make the ROS2 Python API useless for high-framerate or low-latency image pipelines.

Help would be really appreciated!

@karl-schulz karl-schulz changed the title Publishing large data is 50x-100x slower than for rclcpp Publishing large data is 30x-100x slower than for rclcpp Apr 8, 2021
@sloretz sloretz self-assigned this Apr 9, 2021
@karl-schulz
Copy link
Author

@sloretz
Copy link
Contributor

sloretz commented Apr 9, 2021

Thanks for the report. I'm able to reproduce with the latest foxy binaries and your example. I can reproduce it in this smaller example too.

Just python MRE
#!/usr/bin/python3
import array
import builtin_interfaces.msg
import rclpy
from rclpy.node import Node
from rclpy.qos import QoSProfile
from sensor_msgs.msg import PointCloud2
from sensor_msgs.msg import PointField


class SelfCycler(Node):
    """Publish Point cloud to itself endlessly."""
    def __init__(self):
        super().__init__("self_cycle_py")

        pub_qos = QoSProfile(depth=10)
        sub_qos = QoSProfile(depth=10)
        self._pc_sub = self.create_subscription(PointCloud2, topic="pc_py", callback=self._handle_pc, qos_profile=sub_qos)
        self._pc_pub = self.create_publisher(PointCloud2, topic="pc_py", qos_profile=pub_qos)
        self._timer = self.create_timer(1.0, self.publish_msg)

    def publish_msg(self):
        msg = self.create_pc_msg()
        t1 = self.get_clock().now().to_msg()
        self._pc_pub.publish(msg)
        t2 = self.get_clock().now().to_msg()
        self.get_logger().info(f"publishing took {self.format_dt(t1, t2)}")

    def create_pc_msg(self):
        msg = PointCloud2()
        msg.header.stamp = self.get_clock().now().to_msg()
        msg.header.frame_id = "base_link"

        # 10 MB
        msg.height = 1
        msg.width = 327680

        x_field = PointField()
        x_field.name = "x"
        x_field.offset = 0
        x_field.datatype = PointField.FLOAT32
        x_field.count = 1

        y_field = PointField()
        y_field.name = "y"
        y_field.offset = 4
        y_field.datatype = PointField.FLOAT32
        y_field.count = 1

        z_field = PointField()
        z_field.name = "z"
        z_field.offset = 8
        z_field.datatype = PointField.FLOAT32
        z_field.count = 1

        rgb_field = PointField()
        rgb_field.name = "rgb"
        rgb_field.offset = 16
        rgb_field.datatype = PointField.FLOAT32
        rgb_field.count = 1
        msg.fields = [x_field, y_field, z_field, rgb_field]

        msg.is_bigendian = False
        msg.point_step = 32
        msg.row_step = 10485760
        msg.data = array.array('B', [123 for _ in range(10485760)])
        msg.is_dense = True

        return msg

    def format_dt(self, t1: builtin_interfaces.msg.Time, t2: builtin_interfaces.msg.Time):
        """ Helper for formatting the difference between two stamps in microseconds """
        us1 = t1.sec * 1e6 + t1.nanosec // 1e3
        us2 = t2.sec * 1e6 + t2.nanosec // 1e3
        return f"{int(us2 - us1):5} [us]"

    def _handle_pc(self, pc: PointCloud2):
        """ Callback for the PC subscriber. """
        self.get_logger().info("Got point cloud!")


def main(args=None):
    rclpy.init()
    node = SelfCycler()
    try:
        rclpy.spin(node)
    finally:
        node.destroy_node()


if __name__ == '__main__':
    main()

Command output using latest Foxy debians on Ubuntu Focal

$ python3 self_cycle.py
[INFO] [1617994828.072710265] [self_cycle_py]: publishing took 85728 [us]
[INFO] [1617994828.081663188] [self_cycle_py]: Got point cloud!
[INFO] [1617994829.062116464] [self_cycle_py]: publishing took 80831 [us]
[INFO] [1617994829.073127008] [self_cycle_py]: Got point cloud!
[INFO] [1617994830.057854277] [self_cycle_py]: publishing took 82257 [us]
[INFO] [1617994830.069011414] [self_cycle_py]: Got point cloud!
[INFO] [1617994831.058229997] [self_cycle_py]: publishing took 83989 [us]
[INFO] [1617994831.069128751] [self_cycle_py]: Got point cloud!

@clalancette
Copy link
Contributor

@sloretz Out of curiosity, how does Rolling/Galactic fair?

@sloretz
Copy link
Contributor

sloretz commented Apr 9, 2021

Output using cProfile

#...
    def publish_msg(self):
        msg = self.create_pc_msg()

        with cProfile.Profile() as pr:
            t1 = self.get_clock().now().to_msg()
            self._pc_pub.publish(msg)
            t2 = self.get_clock().now().to_msg()
            pr.print_stats()
    
        self.get_logger().info(f"publishing took {self.format_dt(t1, t2)}")
        sys.exit()
#...
cProfile output
         783 function calls in 0.082 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:103(release)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:143(__init__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:147(__enter__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:151(__exit__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:176(cb)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
       22    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:342(__init__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:35(_new_module)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:376(cached)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:389(parent)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:397(has_location)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:477(_init_module_attrs)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:549(module_from_spec)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:58(__init__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:650(_load_unlocked)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:725(find_spec)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:78(acquire)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:800(find_spec)
        3    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:863(__enter__)
        3    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:867(__exit__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:890(_find_spec)
        1    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
        1    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:986(_find_and_load)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1010(path_stats)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1265(_path_importer_cache)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1302(_get_spec)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1334(find_spec)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1426(_get_spec)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1431(find_spec)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:294(cache_from_source)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:40(_relax_case)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:424(_get_cached)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:456(_check_name_wrapper)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:493(_classify_pyc)
        3    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:51(_unpack_uint32)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:526(_validate_timestamp_pyc)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:578(_compile_bytecode)
       21    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:62(_path_join)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:629(spec_from_file_location)
       21    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:68(_path_split)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:774(create_module)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:777(exec_module)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:80(_path_stat)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:849(get_code)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:90(_path_is_mode_type)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:939(__init__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:964(get_filename)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:969(get_data)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:99(_path_isfile)
        1    0.000    0.000    0.000    0.000 _header.py:122(stamp)
        1    0.000    0.000    0.000    0.000 _header.py:136(frame_id)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:171(header)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:185(height)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:200(width)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:215(fields)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:239(is_bigendian)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:252(point_step)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:267(row_step)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:282(data)
        1    0.000    0.000    0.000    0.000 _point_cloud2.py:310(is_dense)
        4    0.000    0.000    0.000    0.000 _point_field.py:197(name)
        4    0.000    0.000    0.000    0.000 _point_field.py:210(offset)
        4    0.000    0.000    0.000    0.000 _point_field.py:225(datatype)
        4    0.000    0.000    0.000    0.000 _point_field.py:240(count)
        1    0.000    0.000    0.000    0.000 _time.py:117(sec)
        2    0.000    0.000    0.000    0.000 _time.py:122(sec)
        1    0.000    0.000    0.000    0.000 _time.py:132(nanosec)
        2    0.000    0.000    0.000    0.000 _time.py:137(nanosec)
        2    0.000    0.000    0.000    0.000 _time.py:70(__init__)
        6    0.000    0.000    0.000    0.000 _time.py:71(<genexpr>)
        1    0.000    0.000    0.001    0.001 cProfile.py:40(print_stats)
        1    0.000    0.000    0.000    0.000 cProfile.py:50(create_stats)
        2    0.000    0.000    0.000    0.000 clock.py:134(clock_type)
        2    0.000    0.000    0.000    0.000 clock.py:138(handle)
        2    0.000    0.000    0.000    0.000 clock.py:145(now)
        9    0.000    0.000    0.000    0.000 enum.py:12(_is_descriptor)
        1    0.000    0.000    0.000    0.000 enum.py:124(__prepare__)
        1    0.000    0.000    0.000    0.000 enum.py:134(__new__)
        1    0.000    0.000    0.000    0.000 enum.py:151(<dictcomp>)
        1    0.000    0.000    0.000    0.000 enum.py:176(<setcomp>)
        3    0.000    0.000    0.000    0.000 enum.py:197(<genexpr>)
       13    0.000    0.000    0.000    0.000 enum.py:20(_is_dunder)
       14    0.000    0.000    0.000    0.000 enum.py:28(_is_sunder)
       18    0.000    0.000    0.000    0.000 enum.py:373(__setattr__)
        2    0.000    0.000    0.000    0.000 enum.py:478(_get_mixins_)
        2    0.000    0.000    0.000    0.000 enum.py:489(_find_data_type)
        1    0.000    0.000    0.000    0.000 enum.py:510(_find_new_)
        1    0.000    0.000    0.000    0.000 enum.py:58(__init__)
       14    0.000    0.000    0.000    0.000 enum.py:65(__setitem__)
        3    0.000    0.000    0.000    0.000 handle.py:110(_return_capsule)
        3    0.000    0.000    0.000    0.000 handle.py:123(__enter__)
        3    0.000    0.000    0.000    0.000 handle.py:126(__exit__)
        3    0.000    0.000    0.000    0.000 handle.py:97(_get_capsule)
        2    0.000    0.000    0.000    0.000 node.py:313(get_clock)
        1    0.000    0.000    0.000    0.000 pstats.py:1(<module>)
        1    0.000    0.000    0.000    0.000 pstats.py:118(load_stats)
        1    0.000    0.000    0.000    0.000 pstats.py:34(SortKey)
        9    0.000    0.000    0.000    0.000 pstats.py:45(__new__)
        1    0.000    0.000    0.000    0.000 pstats.py:468(TupleComp)
        1    0.000    0.000    0.000    0.000 pstats.py:55(Stats)
        1    0.000    0.000    0.000    0.000 pstats.py:89(__init__)
        1    0.000    0.000    0.000    0.000 pstats.py:99(init)
        1    0.000    0.000    0.081    0.081 publisher.py:60(publish)
        1    0.000    0.000    0.000    0.000 publisher.py:86(handle)
        2    0.000    0.000    0.000    0.000 time.py:138(to_msg)
        2    0.000    0.000    0.000    0.000 time.py:26(__init__)
        2    0.000    0.000    0.000    0.000 time.py:42(nanoseconds)
        2    0.000    0.000    0.000    0.000 time.py:46(seconds_nanoseconds)
       10    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x907780}
        1    0.000    0.000    0.000    0.000 {built-in method _imp._fix_co_filename}
        5    0.000    0.000    0.000    0.000 {built-in method _imp.acquire_lock}
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_builtin}
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_frozen}
        5    0.000    0.000    0.000    0.000 {built-in method _imp.release_lock}
        2    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        2    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.all}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.any}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
       19    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       43    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      153    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
       34    0.000    0.000    0.000    0.000 {built-in method builtins.len}
       13    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        3    0.000    0.000    0.000    0.000 {built-in method from_bytes}
        1    0.000    0.000    0.000    0.000 {built-in method io.open_code}
        1    0.000    0.000    0.000    0.000 {built-in method marshal.loads}
        3    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
        6    0.000    0.000    0.000    0.000 {built-in method posix.stat}
        2    0.000    0.000    0.000    0.000 {built-in method rclpy._rclpy.rclpy_clock_get_now}
        2    0.000    0.000    0.000    0.000 {built-in method rclpy._rclpy.rclpy_create_time_point}
        1    0.081    0.081    0.081    0.081 {built-in method rclpy._rclpy.rclpy_publish}
        4    0.000    0.000    0.000    0.000 {built-in method rclpy._rclpy.rclpy_time_point_get_nanoseconds}
       28    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        7    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
       18    0.000    0.000    0.000    0.000 {method 'get' of 'mappingproxy' objects}
        9    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 {method 'items' of 'mappingproxy' objects}
       23    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'mro' of 'type' objects}
        3    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
       10    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
       44    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}


[INFO] [1617995338.353630250] [self_cycle_py]: publishing took 81414 [us]

Seems to say all significant time spent is inside of rclpy._rclpy.rclpy_publish.

@sloretz
Copy link
Contributor

sloretz commented Apr 9, 2021

@sloretz Out of curiosity, how does Rolling/Galactic fair?

It's not the same comparison, but my Rolling workspace built in Debug mode is the same order of magnitude for Python. I haven't tried the example comparing it with C++.

Singularity> python3 src/self_cycle.py
[INFO] [1617995587.216129384] [self_cycle_py]: publishing took 98169 [us]
[INFO] [1617995587.225600170] [self_cycle_py]: Got point cloud!
[INFO] [1617995588.200063262] [self_cycle_py]: publishing took 96978 [us]
[INFO] [1617995588.205213989] [self_cycle_py]: Got point cloud!
[INFO] [1617995589.214581427] [self_cycle_py]: publishing took 104143 [us]
[INFO] [1617995589.221240638] [self_cycle_py]: Got point cloud!

@clalancette
Copy link
Contributor

It's not the same comparison, but my Rolling workspace built in Debug mode is the same order of magnitude for Python. I haven't tried the example comparing it with C++.

Thanks. Then I'll leave this on the Galactic board to look at.

@sloretz
Copy link
Contributor

sloretz commented Apr 9, 2021

Using callgrind after building with RelWithDebInfo seems to say the problem is in sensor_msgs__msg__point_cloud2__convert_from_py where it calls PyLong_AsUnsignedLong a bunch of times, which is almost certainly where it's converting from the data field. Maybe there's a faster way to do this if the Python type is already an array?

@hidmic you called the location :)

Template for generated code https://github.com/ros2/rosidl_python/blob/5b9fe9cad6876e877cbbcf17950c47a9e753c6e6/rosidl_generator_py/resource/_msg_support.c.em#L158

Generated code in file build/sensor_msgs/rosidl_generator_py/sensor_msgs/msg/_point_cloud2_s.c

Generated code for above function
ROSIDL_GENERATOR_C_EXPORT
bool sensor_msgs__msg__point_cloud2__convert_from_py(PyObject * _pymsg, void * _ros_message)
{
  // check that the passed message is of the expected Python class
  {
    char full_classname_dest[42];
    {
      char * class_name = NULL;
      char * module_name = NULL;
      {
        PyObject * class_attr = PyObject_GetAttrString(_pymsg, "__class__");
        if (class_attr) {
          PyObject * name_attr = PyObject_GetAttrString(class_attr, "__name__");
          if (name_attr) {
            class_name = (char *)PyUnicode_1BYTE_DATA(name_attr);
            Py_DECREF(name_attr);
          }
          PyObject * module_attr = PyObject_GetAttrString(class_attr, "__module__");
          if (module_attr) {
            module_name = (char *)PyUnicode_1BYTE_DATA(module_attr);
            Py_DECREF(module_attr);
          }
          Py_DECREF(class_attr);
        }
      }
      if (!class_name || !module_name) {
        return false;
      }
      snprintf(full_classname_dest, sizeof(full_classname_dest), "%s.%s", module_name, class_name);
    }
    assert(strncmp("sensor_msgs.msg._point_cloud2.PointCloud2", full_classname_dest, 41) == 0);
  }
  sensor_msgs__msg__PointCloud2 * ros_message = _ros_message;
  {  // header
    PyObject * field = PyObject_GetAttrString(_pymsg, "header");
    if (!field) {
      return false;
    }
    if (!std_msgs__msg__header__convert_from_py(field, &ros_message->header)) {
      Py_DECREF(field);
      return false;
    }
    Py_DECREF(field);
  }
  {  // height
    PyObject * field = PyObject_GetAttrString(_pymsg, "height");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->height = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // width
    PyObject * field = PyObject_GetAttrString(_pymsg, "width");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->width = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // fields
    PyObject * field = PyObject_GetAttrString(_pymsg, "fields");
    if (!field) {
      return false;
    }
    PyObject * seq_field = PySequence_Fast(field, "expected a sequence in 'fields'");
    if (!seq_field) {
      Py_DECREF(field);
      return false;
    }
    Py_ssize_t size = PySequence_Size(field);
    if (-1 == size) {
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    if (!sensor_msgs__msg__PointField__Sequence__init(&(ros_message->fields), size)) {
      PyErr_SetString(PyExc_RuntimeError, "unable to create sensor_msgs__msg__PointField__Sequence ros_message");
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    sensor_msgs__msg__PointField * dest = ros_message->fields.data;
    for (Py_ssize_t i = 0; i < size; ++i) {
      if (!sensor_msgs__msg__point_field__convert_from_py(PySequence_Fast_GET_ITEM(seq_field, i), &dest[i])) {
        Py_DECREF(seq_field);
        Py_DECREF(field);
        return false;
      }
    }
    Py_DECREF(seq_field);
    Py_DECREF(field);
  }
  {  // is_bigendian
    PyObject * field = PyObject_GetAttrString(_pymsg, "is_bigendian");
    if (!field) {
      return false;
    }
    assert(PyBool_Check(field));
    ros_message->is_bigendian = (Py_True == field);
    Py_DECREF(field);
  }
  {  // point_step
    PyObject * field = PyObject_GetAttrString(_pymsg, "point_step");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->point_step = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // row_step
    PyObject * field = PyObject_GetAttrString(_pymsg, "row_step");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->row_step = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // data
    PyObject * field = PyObject_GetAttrString(_pymsg, "data");
    if (!field) {
      return false;
    }
    PyObject * seq_field = PySequence_Fast(field, "expected a sequence in 'data'");
    if (!seq_field) {
      Py_DECREF(field);
      return false;
    }
    Py_ssize_t size = PySequence_Size(field);
    if (-1 == size) {
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    if (!rosidl_runtime_c__uint8__Sequence__init(&(ros_message->data), size)) {
      PyErr_SetString(PyExc_RuntimeError, "unable to create uint8__Sequence ros_message");
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    uint8_t * dest = ros_message->data.data;
    for (Py_ssize_t i = 0; i < size; ++i) {
      PyObject * item = PySequence_Fast_GET_ITEM(seq_field, i);
      if (!item) {
        Py_DECREF(seq_field);
        Py_DECREF(field);
        return false;
      }
      assert(PyLong_Check(item));
      uint8_t tmp = (uint8_t)PyLong_AsUnsignedLong(item);

      memcpy(&dest[i], &tmp, sizeof(uint8_t));
    }
    Py_DECREF(seq_field);
    Py_DECREF(field);
  }
  {  // is_dense
    PyObject * field = PyObject_GetAttrString(_pymsg, "is_dense");
    if (!field) {
      return false;
    }
    assert(PyBool_Check(field));
    ros_message->is_dense = (Py_True == field);
    Py_DECREF(field);
  }

  return true;
}

@clalancette
Copy link
Contributor

Potentially https://docs.python.org/3/c-api/buffer.html , along with a single memcpy into the destination buffer? Whether that will work depends on the type of the Python object. While that is still a copy, it should be much faster than the loop we currently have.

@gavanderhoorn

This comment has been minimized.

@karl-schulz
Copy link
Author

karl-schulz commented Apr 10, 2021

Hi, first of all thanks to @sloretz and @clalancette for narrowing down the problem so fast!!

@gavanderhoorn I think the questions you are referring to also talk about performance problems, but with different reasons:

  • Middleware problems where messages were dropped
  • More severe performance issues (multiple seconds latency) which were fixed before Foxy
  • Conversion of messages from cv2 or issues of the data setter of the rclpy message

We really spent a lot of time searching for questions where the problem is really only the Python API. That's why we also spent the time trying out all solutions from the existing questions you mentioned (see "Tried Solutions" above) and isolating the issue as much as possible in the reproduction package.

I would also say that is is not really relevant if the problem might be old, because it stays a problem if rclpy can't keep up with the ROS1 Python API in terms of performance.

Looking at the section @sloretz found to be the issue, I think it seems that every byte of the 10MB data array is iterated over with casts and single memcpy, which I think is a really promising lead :)

@gavanderhoorn

This comment has been minimized.

@sloretz
Copy link
Contributor

sloretz commented Jun 23, 2021

I think this can be closed now that ros2/rosidl_python#129 has been merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants