Improve service discovery #392

hidmic · 2020-05-28T13:51:47Z

Feature request

Feature description

Follow-up issue after #390. Prior to that PR, services' discovery would rely on built-in request/reply topics discovery, which can and did lead to races (see ros2/ros2#922). A short-term workaround was then introduced by #390, but not without caveats i.e. blocking behavior of rmw_send_response, non-standard (ab)use of sample identity data.

OMG DDS-RPC 1.0 spec describes both the issues associated with independent discovery of request/reply topics (see Basic Service Mapping under section 7.6.1) and an algorithm to prevent them (see Enhanced Service Mapping under section 7.6.2). Services' implementation in rmw_fastrtps*_cpp could and probably should move in that direction.

The text was updated successfully, but these errors were encountered:

sloretz · 2020-06-05T16:44:19Z

@richiware Friendly FYI. Does eProsima have any plans to work on this one?

richiware · 2020-06-08T11:33:27Z

We were talking this morning about this. We have in mind to support deeply in the future the DDS-RPC standard on FastDDS. But right now the service discovery issue will be addressed by #390 solution. There are other features/issues that need our focus.

hidmic · 2020-06-08T14:34:33Z

@richiware #390 is not a solution, it is a workaround at best and seemingly not an effective one for all cases (see ros2/rclcpp#1152 and ros2/ros2#931). To the extent that it is possible, please, do consider addressing this issue at some point in the near future.

ivanpauno · 2020-06-08T17:20:05Z

I agree with @hidmic that #390 is a workaround.

Though a full implementation of DDS-RPC standard would be great, that's a lot of effort.
Support of an enhanced discovery mechanism to avoid the race condition should be enough to solve the problem.

richiware · 2020-06-12T06:12:48Z

We've added your suggestion on our roadmap.

SteveMacenski · 2020-06-18T19:21:14Z

To summarize issues from ros2/ros2#931 with respect to navigation reported by @daisukes

Upgrading to 2.x.x from 1.x.x results in services failing to function (<5% successful calls), which is a safety issue with fast-dds. Rolling back fixes this.

dirk-thomas · 2020-06-18T19:22:11Z

@richiware With the report mention in the previous comment I think there is a clear regression which should be addressed asap.

richiware · 2020-07-02T09:42:32Z

@MiguelCompany could you help in this?

SteveMacenski · 2020-07-07T19:09:00Z

@JaimeMartin can we get an update on this? This is a safety critical issue and we may have to switch primary support and instructions for Navigation2 to Cyclone if we can't reliably call services in Foxy. It could genuinely cause complete system failure.

MiguelCompany · 2020-07-08T06:48:51Z

@SteveMacenski Looking at ros2/ros2#931 I see all the analysis was done before #390 was merged. I know it is a workaround and it has not made its way to a Foxy patch release (it is only in master), but have you checked if it improves the behavior?

Regarding the commit that seems to be responsible, it is the merge commit of eProsima/Fast-DDS#1190. Among other things, that PR fixes a wrong behavior with volatile late-joiners. Before the PR, volatile late-joiners were (incorrectly) receiving old samples in some cases, and fixing that to be compliant with RTPS standard, makes the service discovery race more likely to happen. Commits affecting volatile late-joiners are eProsima/Fast-DDS@a1c35b5 and eProsima/Fast-DDS@30f8f6f

Nevertheless, we have just started analyzing ros-navigation/navigation2#1772 and, at the same time, designing a proper service discovery on rmw_fastrtps.

SteveMacenski · 2020-07-08T19:52:06Z

We also had to drop some unit tests as well because of this issue causing them to fail with Fast-RTPS only ros-navigation/navigation2#1856 and reports of various elements of navigation2 not working (recovery behaviors) due to services not working.

Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?

jacobperron · 2020-07-13T23:38:12Z

Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?

The workaround in #390 dropped from my radar. I've added it to the project board and will make sure it gets into the next Foxy sync.

MiguelCompany · 2020-07-20T06:20:57Z

@dirk-thomas @jacobperron As you can see here the issue has been solved by eProsima/Fast-DDS#1295. I hope this can be released in foxy ASAP.

jacobperron · 2020-07-20T21:39:14Z

@MiguelCompany With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?

MiguelCompany · 2020-07-21T07:29:23Z

With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?

@jacobperron No, the changes in #390 are still necessary, as there are still chances of asymmetric discovery of endpoints. Nevertheless, we are preparing a design for a different approach to how the service discovery race is handled. Although the new mechanism will be inspired by the DDS-RPC standard, it will not strictly follow it.

We will open a PR in a few days with the proposed design.

jacobperron · 2020-07-21T16:47:06Z

@MiguelCompany If you can create a new release tag for the Fast-DDS 2.x.x branch, I can help bloom a new release into rosdistro for Foxy.

dirk-thomas · 2020-07-21T17:36:11Z

I would suggest to wait until eProsima/Fast-DDS#1304 has landed too (which should hopefully be soon - CI is running as we speak).

MiguelCompany · 2020-07-23T05:43:59Z

@dirk-thomas @jacobperron The warnings related PRs have been merged, and I have created tag 2.0.1-rc, which will most likely become release v2.0.1 after we proceed with release testing.

IkerLuengo · 2020-07-31T08:44:28Z

We have done the design to address the problem with the service discovery. You can find it documented on PR #418

aditya2592 · 2022-06-22T00:00:30Z

Hi, I am facing several issues with service reliability when using ROS2 Galactic with Fast DDS over SHM. Also using discovery server. Most common occurrences:

Request is sent by client and response isn't received : This happens even with RELIABLE and KEEP_ALL QOS settings. However sending another request after a timeout works fine.
Client fails to discover service during startup when using wait_for_service, even though the service has started up fine.

Is there anything on the ROS2 roadmap that will help address service reliability in the near term? We just switched to ROS2 and this issue is hurting our ability to build a stack around it, because being able to have a service driven infrastructure is critical to efficient use of CPU on low compute platforms.

MarcoMatteoBassa · 2022-06-24T07:40:03Z

HI @aditya2592 , are you guys running into these issues while using docker or also on a normal installation?
We could reproduce the first issue (see https://github.com/MarcoMatteoBassa/reproduce_client_bug) but only while running on a docker container (not specifically with fastrtps, happens with cyclone too), so we are wondering if there's any docker-specific problem.

aditya2592 · 2022-06-24T15:46:04Z

Hi @MarcoMatteoBassa, yes we are also running through a docker, thats awesome that you were able to reproduce it. I have been having trouble doing that because it is indeed effected by system load and low compute platforms. On top of that, everything we have so far is on the same system ie on localhost itself. Because of these issues on the same machine, we havent been able to expand to ROS2 across machines. We also switched to using the Fast DDS discovery server with below config as suggested in the tutorials, though I am still unsure why discovery server would be needed to discover services on the same machine

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
    <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <participant profile_name="super_client_profile" is_default_profile="true">
            <rtps>
                <sendSocketBufferSize>1048576</sendSocketBufferSize>
                <listenSocketBufferSize>4194304</listenSocketBufferSize>
                <builtin>
                    <discovery_config>
                        <discoveryProtocol>SUPER_CLIENT</discoveryProtocol>
                        <discoveryServersList>
                            <RemoteServer prefix="44.53.00.5f.45.50.52.4f.53.49.4d.41">
                                <metatrafficUnicastLocatorList>
                                    <locator>
                                        <udpv4>
                                            <address>127.0.0.1</address>
                                            <port>11811</port>
                                        </udpv4>
                                    </locator>
                                </metatrafficUnicastLocatorList>
                            </RemoteServer>
                        </discoveryServersList>
                    </discovery_config>
                </builtin>
            </rtps>
        </participant>
    </profiles>
</dds>

MarcoMatteoBassa · 2022-06-24T16:10:30Z

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic?
For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

fujitatomoya · 2022-06-24T16:40:24Z

@MiguelCompany this could be related to #616?

aditya2592 · 2022-06-24T17:00:17Z

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

We had a lot more issues with cyclone so we stopped experimenting with that early on. There were memory leaks and general latency was not good with cyclone. But I have seen issues on both galactic and humble using Fast DDS with/without discovery server, with all services and clients on the same machine

aditya2592 · 2022-07-05T21:36:38Z

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

@MarcoMatteoBassa Were you able to run your experiment with the fix in this MR - #616 ?

MarcoMatteoBassa · 2022-07-07T13:05:41Z

Hi @aditya2592 @fujitatomoya , I just tried to use the latest version of osrf/ros2:nightly (verified that the specified fix is there) as base for my test in https://github.com/MarcoMatteoBassa/reproduce_client_bug but I can still reproduce the issue

fujitatomoya · 2022-07-07T16:06:30Z

@MarcoMatteoBassa thanks for checking, i cannot reproduce with your test but technically there is a race condition for service discovery as described on this thread.

MarcoMatteoBassa · 2022-07-07T16:15:13Z

@fujitatomoya thanks for trying, JFI, here is the corresponding issue (opend on the docker images repo because I can only reproduce it on docker) osrf/docker_images#628

clalancette · 2022-07-07T16:38:07Z

The change in #616 has been merged, but not released yet. That means it is definitely not in the docker images yet. We'll have to do a new release of rmw_fastrtps into Rolling, and then rebuild the docker images before the test is valid.

In the meantime, @MarcoMatteoBassa is there any chance you can test from sources instead? You can follow the instructions at https://docs.ros.org/en/rolling/Installation/Alternatives/Ubuntu-Development-Setup.html to do that.

aditya2592 · 2022-07-07T18:51:30Z

@MarcoMatteoBassa I incorporated the fix into my Docker image like below by only building rmw_fastrtps from source on the fixed branch. Rest of ROS2 is still from the existing docker image on Dockerhub. @clalancette This approach should work right?

ARG ROS_DISTRO=galactic
FROM ros:${ROS_DISTRO}-ros-base

WORKDIR /root/ws_ros
# RMW Fast DDS
# NOTE: this branch is used to incorporate fix for services
RUN git clone https://github.com/ros2/rmw_fastrtps/ -b mergify/bp/${ROS_DISTRO}/pr-616 src/rmw_fastrtps
RUN . /opt/ros/${ROS_DISTRO}/setup.sh && \
    rosdep update && \
    apt-get -q update && \
    DEBIAN_FRONTEND=noninteractive \
    # Install RMW Fast RTPS deps
    rosdep install -y --from-paths src/rmw_fastrtps --ignore-src --rosdistro ${ROS_DISTRO} --as-root=apt:false && \
    colcon build \
            --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_TESTING=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
            --ament-cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo \
            --event-handlers desktop_notification- status- && \
    # Remove the source code from this container
    rm -rf src build && \
    ccache -s

Also another thing I am trying is to limit transports to only shared memory. This for sure reduced service request/response getting dropped on one of our PCs but they still happen occasionally:

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
        <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <transport_descriptors>
            <!-- Create a descriptor for the new transport -->
            <transport_descriptor>
                <transport_id>shm_transport</transport_id>
                <type>SHM</type>
            </transport_descriptor>
        </transport_descriptors>

        <participant profile_name="super_client_profile" is_default_profile="true">
            <rtps>
                <!-- Link the Transport Layer to the Participant -->
                <userTransports>
                    <transport_id>shm_transport</transport_id>
                </userTransports>
                <useBuiltinTransports>false</useBuiltinTransports>
            </rtps>
        </participant>
    </profiles>
</dds>

MarcoMatteoBassa · 2022-07-08T07:13:45Z

@clalancette I verified that on osrf/ros2:nightly the installed rmw_fastrtps_shared_cpp/custom_client_info.hpp and rmw_fastrtps_shared_cpp/custom_service_info.hpp were modified as in the fix https://github.com/ros2/rmw_fastrtps/pull/616/files.
But sure, in the coming days I can also try it from source. @aditya2592 Thanks for the dockerfile!

MarcoMatteoBassa · 2022-07-18T09:22:53Z

Hi @clalancette , just tested with a full src installation, but this is unfortunately still reproducible

aditya2592 · 2022-07-18T15:35:01Z

Update from my end on this, used the following config to completely eliminate any transport except SHM and also assigned a 75MB segment size. I also had to make all service callbacks fall in the Reentrant callback group even if they don't require to be called in parallel and have only one client requesting at a time. This based on experiments so far removed service drops on Humble. It also seems to allow discovery without using discovery server. Other things I tried that didn't work :

SHM only transport but no segment_size increase. This caused an enormous amount of drops.
export ROS_LOCALHOST_ONLY=1

Even though this works for our use case right now, services should be reliable for any transport on ROS ideally.

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
    <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <transport_descriptors>
            <!-- SHM transport should be the only one used for IPC on same machine -->
            <transport_descriptor>
                <transport_id>shm_transport</transport_id>
                <type>SHM</type>
                <!-- 75MB is assigned to each participant which means every process, one participant is equivalent to one process -->
                <!-- Multiply this by total processes in system to get overall SHM consumption -->
                <segment_size>78643200</segment_size>
                <maxMessageSize>78643200</maxMessageSize>
            </transport_descriptor>
        </transport_descriptors>

        <participant profile_name="super_client_profile" is_default_profile="true">
            <rtps>
                <!-- SHM only is enabled to enforce SHM use because currently all ROS2 comms happen on same machine -->
                <userTransports>
                    <transport_id>shm_transport</transport_id>
                </userTransports>
                <!-- Prevent builtin which automatically chooses transport -->
                <useBuiltinTransports>false</useBuiltinTransports>
            </rtps>
        </participant>
    </profiles>
</dds>

Jayden-F · 2023-05-10T00:52:36Z

#691

Is this planning to be fixed in Humble?

This was referenced May 28, 2020

Make service wait for response reader #390

Merged

Improve service discovery ros2/rmw_cyclonedds#191

Open

ivanpauno mentioned this issue Jun 3, 2020

Fast-RPTS 2.0.x might introduce service performance issue ros2/ros2#931

Closed

This was referenced Jun 18, 2020

Unreliable services ros2/rclcpp#1152

Closed

Flaky services performance ros2/ros2#922

Closed

fujitatomoya mentioned this issue Jun 24, 2020

"dynamic" service becomes unresponsive after being destroyed and created again ros2/rclcpp#1175

Closed

MiguelCompany mentioned this issue Jul 27, 2020

Fix sending GAPs to late joiners [8880] eProsima/Fast-DDS#1295

Merged

IkerLuengo mentioned this issue Jul 31, 2020

Fast-DDS service discovery redesign #418

Open

rob-clarke mentioned this issue Feb 27, 2022

ROS2 sys_status plugin services callbacks not executing mavlink/mavros#1588

Open

fujitatomoya mentioned this issue Mar 21, 2023

ROS2 examples_rclcpp_minimal_client test fail #677

Open

MichaelOrlov mentioned this issue Jan 16, 2024

👨‍🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs ros2/rosbag2#862

Open

Improve service discovery #392

Improve service discovery #392

Comments

hidmic commented May 28, 2020

Feature request

Feature description

sloretz commented Jun 5, 2020

richiware commented Jun 8, 2020

hidmic commented Jun 8, 2020

ivanpauno commented Jun 8, 2020

richiware commented Jun 12, 2020

SteveMacenski commented Jun 18, 2020 • edited Loading

dirk-thomas commented Jun 18, 2020 • edited Loading

richiware commented Jul 2, 2020

SteveMacenski commented Jul 7, 2020

MiguelCompany commented Jul 8, 2020 • edited Loading

SteveMacenski commented Jul 8, 2020

jacobperron commented Jul 13, 2020 • edited Loading

MiguelCompany commented Jul 20, 2020

jacobperron commented Jul 20, 2020

MiguelCompany commented Jul 21, 2020

jacobperron commented Jul 21, 2020

dirk-thomas commented Jul 21, 2020

MiguelCompany commented Jul 23, 2020

IkerLuengo commented Jul 31, 2020

aditya2592 commented Jun 22, 2022 • edited Loading

MarcoMatteoBassa commented Jun 24, 2022 • edited Loading

aditya2592 commented Jun 24, 2022 • edited Loading

MarcoMatteoBassa commented Jun 24, 2022

fujitatomoya commented Jun 24, 2022

aditya2592 commented Jun 24, 2022

aditya2592 commented Jul 5, 2022

MarcoMatteoBassa commented Jul 7, 2022 • edited Loading

fujitatomoya commented Jul 7, 2022

MarcoMatteoBassa commented Jul 7, 2022

clalancette commented Jul 7, 2022

aditya2592 commented Jul 7, 2022 • edited Loading

MarcoMatteoBassa commented Jul 8, 2022

MarcoMatteoBassa commented Jul 18, 2022

aditya2592 commented Jul 18, 2022 • edited Loading

Jayden-F commented May 10, 2023

SteveMacenski commented Jun 18, 2020 •

edited

Loading

dirk-thomas commented Jun 18, 2020 •

edited

Loading

MiguelCompany commented Jul 8, 2020 •

edited

Loading

jacobperron commented Jul 13, 2020 •

edited

Loading

aditya2592 commented Jun 22, 2022 •

edited

Loading

MarcoMatteoBassa commented Jun 24, 2022 •

edited

Loading

aditya2592 commented Jun 24, 2022 •

edited

Loading

MarcoMatteoBassa commented Jul 7, 2022 •

edited

Loading

aditya2592 commented Jul 7, 2022 •

edited

Loading

aditya2592 commented Jul 18, 2022 •

edited

Loading