Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROS2][Humble desktop] Services seldomly not getting responses #628

Closed
MarcoMatteoBassa opened this issue Jun 24, 2022 · 10 comments
Closed

Comments

@MarcoMatteoBassa
Copy link

Hi guys, I'm writing because we are running into an issue where our ROS service clients seldomly fail to get a reply from the servers when using the humble-desktop image. A detailed description of the issue and the code to reproduce it can be found on.

https://github.com/MarcoMatteoBassa/reproduce_client_bug

I opened the issue here because we can't reproduce the bug when installing humble directly on our system, but let me know if I should move it to the ros2 issues. Thanks in advance for looking into it!

@gavanderhoorn
Copy link

clients seldomly fail

pedantic, but this means they almost always succeed.

You mean they rarely succeed right?

Also: is this a cross-post of [ROS2][Humble] async_send_request service result seldomly never returns using executors on ROS Answers?

@MarcoMatteoBassa
Copy link
Author

MarcoMatteoBassa commented Jun 24, 2022

Hi @gavanderhoorn, thanks for replying. They usually succeed as long as running on a fast system with no heavy load, and quickly fail on slower systems or if loading the CPU with some work. Further testing showed that the issue seems to be reproducible only when running on a humble-docker image (so I closed https://answers.ros.org/question/402780/ros2humble-async_send_request-service-result-seldomly-never-returns-using-executors/ and moved this here). If running on an older image (galactic), or if using a normal ros2 installation on Ubuntu 22, the issue disappers. The issue persists with cyclone-dds.

@ruffsl
Copy link
Member

ruffsl commented Jun 24, 2022

At first I was going to inquire about how you where launching the docker containers, specifically which network interfaces you chose to bind to the container. I was about to suggest that the default bridge network used in docker could be dropping network packets can causing the client to miss server responses.

https://github.com/MarcoMatteoBassa/reproduce_client_bug/blob/main/docker/run_docker_humble.sh

However from your run script, it apearse you are already binding the host's network interfaces directly to the container, as well as the IPC (but notable not PID space, not sure that would matter here), so that appears to rule out that theory.

https://answers.ros.org/question/296828/ros2-connectivity-across-docker-containers-via-host-driver/

For ROS2, what is the default QOS for the DDS topics that shuttle the server's replies to service requests? Is it best effort by chance, or reliable? In particular, I am asking about this exact equivalent DDS topic:

https://github.com/ros2/sros2/blob/c40c70635773e3d2c9d2965c9e4008cf38f6a069/sros2/test/policies/permissions/add_two_ints/permissions.xml#L21

@MarcoMatteoBassa
Copy link
Author

Hi @ruffsl , thanks for thaking a look. Both the clients and the servers were created using

rmw_qos_profile_t rmw_qos_profile_services_default =
{
RMW_QOS_POLICY_HISTORY_KEEP_LAST,
10,
RMW_QOS_POLICY_RELIABILITY_RELIABLE,
RMW_QOS_POLICY_DURABILITY_VOLATILE,
RMW_QOS_DEADLINE_DEFAULT,
RMW_QOS_LIFESPAN_DEFAULT,
RMW_QOS_POLICY_LIVELINESS_SYSTEM_DEFAULT,
RMW_QOS_LIVELINESS_LEASE_DURATION_DEFAULT,
false
};

So I would expect it to be reliable, is there any way I can introspect the actual underlying dds implementation too?

@ruffsl
Copy link
Member

ruffsl commented Jun 24, 2022

is there any way I can introspect the actual underlying dds implementation too?

I know RTI has some nice debugging tools, but they do require a licence (could try the free trial period). Another alternative that I haven't got around to explore yet is eProsima's Fast DDS Monitor (Free!), which looks quite promising:

https://www.eprosima.com/index.php/products-all/eprosima-fast-dds-monitor
https://github.com/eProsima/Fast-DDS-monitor

@MarcoMatteoBassa
Copy link
Author

is there any way I can introspect the actual underlying dds implementation too?

I know RTI has some nice debugging tools, but they do require a licence (could try the free trial period). Another alternative that I haven't got around to explore yet is eProsima's Fast DDS Monitor (Free!), which looks quite promising:

https://www.eprosima.com/index.php/products-all/eprosima-fast-dds-monitor https://github.com/eProsima/Fast-DDS-monitor

Thanks for the tip. I played a bit with the Fast-DDS-monitor, unfortunately it doesn't allow me to introspect the properties of the topics, but I also couldn't see anything strange in the provided information :( I would anyway assume that reliable is used

@ruffsl
Copy link
Member

ruffsl commented Jun 27, 2022

I have confirmed this using docker, with a simplified example using ros:humble:

MarcoMatteoBassa/reproduce_client_bug#1

I have not yet confirmed without docker though.

@MarcoMatteoBassa
Copy link
Author

MarcoMatteoBassa commented Jul 7, 2022

JFI, I verified that it still occurs on the latest osrf/ros2:nightly, even after ros2/rmw_fastrtps#616

@MarcoMatteoBassa
Copy link
Author

MarcoMatteoBassa commented Jan 2, 2023

This is still reproducible on ros:humble and ros:rolling
but is apparently not happening anymore on osrf/ros2:nightly
I guess that this was fixed here(https://github.com/ros2/rclcpp/pull/2044/files) or by something else on release 18.0.
I'll close as soon as it passes on ros:humble

@MarcoMatteoBassa
Copy link
Author

Verified that
https://github.com/ros2/rclcpp/pull/2044/files
Integrated in the last image release, fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants