Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve service discovery #392

Open
hidmic opened this issue May 28, 2020 · 35 comments
Open

Improve service discovery #392

hidmic opened this issue May 28, 2020 · 35 comments

Comments

@hidmic
Copy link
Contributor

hidmic commented May 28, 2020

Feature request

Feature description

Follow-up issue after #390. Prior to that PR, services' discovery would rely on built-in request/reply topics discovery, which can and did lead to races (see ros2/ros2#922). A short-term workaround was then introduced by #390, but not without caveats i.e. blocking behavior of rmw_send_response, non-standard (ab)use of sample identity data.

OMG DDS-RPC 1.0 spec describes both the issues associated with independent discovery of request/reply topics (see Basic Service Mapping under section 7.6.1) and an algorithm to prevent them (see Enhanced Service Mapping under section 7.6.2). Services' implementation in rmw_fastrtps*_cpp could and probably should move in that direction.

@sloretz
Copy link
Contributor

sloretz commented Jun 5, 2020

@richiware Friendly FYI. Does eProsima have any plans to work on this one?

@richiware
Copy link
Contributor

We were talking this morning about this. We have in mind to support deeply in the future the DDS-RPC standard on FastDDS. But right now the service discovery issue will be addressed by #390 solution. There are other features/issues that need our focus.

@hidmic
Copy link
Contributor Author

hidmic commented Jun 8, 2020

@richiware #390 is not a solution, it is a workaround at best and seemingly not an effective one for all cases (see ros2/rclcpp#1152 and ros2/ros2#931). To the extent that it is possible, please, do consider addressing this issue at some point in the near future.

@ivanpauno
Copy link
Member

I agree with @hidmic that #390 is a workaround.

Though a full implementation of DDS-RPC standard would be great, that's a lot of effort.
Support of an enhanced discovery mechanism to avoid the race condition should be enough to solve the problem.

@richiware
Copy link
Contributor

We've added your suggestion on our roadmap.

@SteveMacenski
Copy link

SteveMacenski commented Jun 18, 2020

To summarize issues from ros2/ros2#931 with respect to navigation reported by @daisukes

Upgrading to 2.x.x from 1.x.x results in services failing to function (<5% successful calls), which is a safety issue with fast-dds. Rolling back fixes this.

@dirk-thomas
Copy link
Member

dirk-thomas commented Jun 18, 2020

@richiware With the report mention in the previous comment I think there is a clear regression which should be addressed asap.

@richiware
Copy link
Contributor

@MiguelCompany could you help in this?

@SteveMacenski
Copy link

@JaimeMartin can we get an update on this? This is a safety critical issue and we may have to switch primary support and instructions for Navigation2 to Cyclone if we can't reliably call services in Foxy. It could genuinely cause complete system failure.

@MiguelCompany
Copy link
Collaborator

MiguelCompany commented Jul 8, 2020

@SteveMacenski Looking at ros2/ros2#931 I see all the analysis was done before #390 was merged. I know it is a workaround and it has not made its way to a Foxy patch release (it is only in master), but have you checked if it improves the behavior?

Regarding the commit that seems to be responsible, it is the merge commit of eProsima/Fast-DDS#1190. Among other things, that PR fixes a wrong behavior with volatile late-joiners. Before the PR, volatile late-joiners were (incorrectly) receiving old samples in some cases, and fixing that to be compliant with RTPS standard, makes the service discovery race more likely to happen. Commits affecting volatile late-joiners are eProsima/Fast-DDS@a1c35b5 and eProsima/Fast-DDS@30f8f6f

Nevertheless, we have just started analyzing ros-navigation/navigation2#1772 and, at the same time, designing a proper service discovery on rmw_fastrtps.

@SteveMacenski
Copy link

We also had to drop some unit tests as well because of this issue causing them to fail with Fast-RTPS only ros-navigation/navigation2#1856 and reports of various elements of navigation2 not working (recovery behaviors) due to services not working.

Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?

@jacobperron
Copy link
Member

jacobperron commented Jul 13, 2020

Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?

The workaround in #390 dropped from my radar. I've added it to the project board and will make sure it gets into the next Foxy sync.

@MiguelCompany
Copy link
Collaborator

@dirk-thomas @jacobperron As you can see here the issue has been solved by eProsima/Fast-DDS#1295. I hope this can be released in foxy ASAP.

@jacobperron
Copy link
Member

@MiguelCompany With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?

@MiguelCompany
Copy link
Collaborator

With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?

@jacobperron No, the changes in #390 are still necessary, as there are still chances of asymmetric discovery of endpoints. Nevertheless, we are preparing a design for a different approach to how the service discovery race is handled. Although the new mechanism will be inspired by the DDS-RPC standard, it will not strictly follow it.

We will open a PR in a few days with the proposed design.

@jacobperron
Copy link
Member

@MiguelCompany If you can create a new release tag for the Fast-DDS 2.x.x branch, I can help bloom a new release into rosdistro for Foxy.

@dirk-thomas
Copy link
Member

I would suggest to wait until eProsima/Fast-DDS#1304 has landed too (which should hopefully be soon - CI is running as we speak).

@MiguelCompany
Copy link
Collaborator

@dirk-thomas @jacobperron The warnings related PRs have been merged, and I have created tag 2.0.1-rc, which will most likely become release v2.0.1 after we proceed with release testing.

@IkerLuengo
Copy link
Contributor

We have done the design to address the problem with the service discovery. You can find it documented on PR #418

@aditya2592
Copy link

aditya2592 commented Jun 22, 2022

Hi, I am facing several issues with service reliability when using ROS2 Galactic with Fast DDS over SHM. Also using discovery server. Most common occurrences:

  1. Request is sent by client and response isn't received : This happens even with RELIABLE and KEEP_ALL QOS settings. However sending another request after a timeout works fine.
  2. Client fails to discover service during startup when using wait_for_service, even though the service has started up fine.

Is there anything on the ROS2 roadmap that will help address service reliability in the near term? We just switched to ROS2 and this issue is hurting our ability to build a stack around it, because being able to have a service driven infrastructure is critical to efficient use of CPU on low compute platforms.

@MarcoMatteoBassa
Copy link

MarcoMatteoBassa commented Jun 24, 2022

HI @aditya2592 , are you guys running into these issues while using docker or also on a normal installation?
We could reproduce the first issue (see https://github.com/MarcoMatteoBassa/reproduce_client_bug) but only while running on a docker container (not specifically with fastrtps, happens with cyclone too), so we are wondering if there's any docker-specific problem.

@aditya2592
Copy link

aditya2592 commented Jun 24, 2022

Hi @MarcoMatteoBassa, yes we are also running through a docker, thats awesome that you were able to reproduce it. I have been having trouble doing that because it is indeed effected by system load and low compute platforms. On top of that, everything we have so far is on the same system ie on localhost itself. Because of these issues on the same machine, we havent been able to expand to ROS2 across machines. We also switched to using the Fast DDS discovery server with below config as suggested in the tutorials, though I am still unsure why discovery server would be needed to discover services on the same machine

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
    <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <participant profile_name="super_client_profile" is_default_profile="true">
            <rtps>
                <sendSocketBufferSize>1048576</sendSocketBufferSize>
                <listenSocketBufferSize>4194304</listenSocketBufferSize>
                <builtin>
                    <discovery_config>
                        <discoveryProtocol>SUPER_CLIENT</discoveryProtocol>
                        <discoveryServersList>
                            <RemoteServer prefix="44.53.00.5f.45.50.52.4f.53.49.4d.41">
                                <metatrafficUnicastLocatorList>
                                    <locator>
                                        <udpv4>
                                            <address>127.0.0.1</address>
                                            <port>11811</port>
                                        </udpv4>
                                    </locator>
                                </metatrafficUnicastLocatorList>
                            </RemoteServer>
                        </discoveryServersList>
                    </discovery_config>
                </builtin>
            </rtps>
        </participant>
    </profiles>
</dds>

@MarcoMatteoBassa
Copy link

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic?
For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

@fujitatomoya
Copy link
Collaborator

@MiguelCompany this could be related to #616?

@aditya2592
Copy link

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

We had a lot more issues with cyclone so we stopped experimenting with that early on. There were memory leaks and general latency was not good with cyclone. But I have seen issues on both galactic and humble using Fast DDS with/without discovery server, with all services and clients on the same machine

@aditya2592
Copy link

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

@MarcoMatteoBassa Were you able to run your experiment with the fix in this MR - #616 ?

@MarcoMatteoBassa
Copy link

MarcoMatteoBassa commented Jul 7, 2022

Hi @aditya2592 @fujitatomoya , I just tried to use the latest version of osrf/ros2:nightly (verified that the specified fix is there) as base for my test in https://github.com/MarcoMatteoBassa/reproduce_client_bug but I can still reproduce the issue

@fujitatomoya
Copy link
Collaborator

@MarcoMatteoBassa thanks for checking, i cannot reproduce with your test but technically there is a race condition for service discovery as described on this thread.

@MarcoMatteoBassa
Copy link

@fujitatomoya thanks for trying, JFI, here is the corresponding issue (opend on the docker images repo because I can only reproduce it on docker) osrf/docker_images#628

@clalancette
Copy link
Contributor

The change in #616 has been merged, but not released yet. That means it is definitely not in the docker images yet. We'll have to do a new release of rmw_fastrtps into Rolling, and then rebuild the docker images before the test is valid.

In the meantime, @MarcoMatteoBassa is there any chance you can test from sources instead? You can follow the instructions at https://docs.ros.org/en/rolling/Installation/Alternatives/Ubuntu-Development-Setup.html to do that.

@aditya2592
Copy link

aditya2592 commented Jul 7, 2022

@MarcoMatteoBassa I incorporated the fix into my Docker image like below by only building rmw_fastrtps from source on the fixed branch. Rest of ROS2 is still from the existing docker image on Dockerhub. @clalancette This approach should work right?

ARG ROS_DISTRO=galactic
FROM ros:${ROS_DISTRO}-ros-base

WORKDIR /root/ws_ros
# RMW Fast DDS
# NOTE: this branch is used to incorporate fix for services
RUN git clone https://github.com/ros2/rmw_fastrtps/ -b mergify/bp/${ROS_DISTRO}/pr-616 src/rmw_fastrtps
RUN . /opt/ros/${ROS_DISTRO}/setup.sh && \
    rosdep update && \
    apt-get -q update && \
    DEBIAN_FRONTEND=noninteractive \
    # Install RMW Fast RTPS deps
    rosdep install -y --from-paths src/rmw_fastrtps --ignore-src --rosdistro ${ROS_DISTRO} --as-root=apt:false && \
    colcon build \
            --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_TESTING=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
            --ament-cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo \
            --event-handlers desktop_notification- status- && \
    # Remove the source code from this container
    rm -rf src build && \
    ccache -s

Also another thing I am trying is to limit transports to only shared memory. This for sure reduced service request/response getting dropped on one of our PCs but they still happen occasionally:

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
        <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <transport_descriptors>
            <!-- Create a descriptor for the new transport -->
            <transport_descriptor>
                <transport_id>shm_transport</transport_id>
                <type>SHM</type>
            </transport_descriptor>
        </transport_descriptors>

        <participant profile_name="super_client_profile" is_default_profile="true">
            <rtps>
                <!-- Link the Transport Layer to the Participant -->
                <userTransports>
                    <transport_id>shm_transport</transport_id>
                </userTransports>
                <useBuiltinTransports>false</useBuiltinTransports>
            </rtps>
        </participant>
    </profiles>
</dds>

@MarcoMatteoBassa
Copy link

@clalancette I verified that on osrf/ros2:nightly the installed rmw_fastrtps_shared_cpp/custom_client_info.hpp and rmw_fastrtps_shared_cpp/custom_service_info.hpp were modified as in the fix https://github.com/ros2/rmw_fastrtps/pull/616/files.
But sure, in the coming days I can also try it from source. @aditya2592 Thanks for the dockerfile!

@MarcoMatteoBassa
Copy link

Hi @clalancette , just tested with a full src installation, but this is unfortunately still reproducible

@aditya2592
Copy link

aditya2592 commented Jul 18, 2022

Update from my end on this, used the following config to completely eliminate any transport except SHM and also assigned a 75MB segment size. I also had to make all service callbacks fall in the Reentrant callback group even if they don't require to be called in parallel and have only one client requesting at a time. This based on experiments so far removed service drops on Humble. It also seems to allow discovery without using discovery server. Other things I tried that didn't work :

  • SHM only transport but no segment_size increase. This caused an enormous amount of drops.
  • export ROS_LOCALHOST_ONLY=1

Even though this works for our use case right now, services should be reliable for any transport on ROS ideally.

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
    <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <transport_descriptors>
            <!-- SHM transport should be the only one used for IPC on same machine -->
            <transport_descriptor>
                <transport_id>shm_transport</transport_id>
                <type>SHM</type>
                <!-- 75MB is assigned to each participant which means every process, one participant is equivalent to one process -->
                <!-- Multiply this by total processes in system to get overall SHM consumption -->
                <segment_size>78643200</segment_size>
                <maxMessageSize>78643200</maxMessageSize>
            </transport_descriptor>
        </transport_descriptors>

        <participant profile_name="super_client_profile" is_default_profile="true">
            <rtps>
                <!-- SHM only is enabled to enforce SHM use because currently all ROS2 comms happen on same machine -->
                <userTransports>
                    <transport_id>shm_transport</transport_id>
                </userTransports>
                <!-- Prevent builtin which automatically chooses transport -->
                <useBuiltinTransports>false</useBuiltinTransports>
            </rtps>
        </participant>
    </profiles>
</dds>

@Jayden-F
Copy link

#691

Is this planning to be fixed in Humble?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests