-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve service discovery #392
Comments
@richiware Friendly FYI. Does eProsima have any plans to work on this one? |
We were talking this morning about this. We have in mind to support deeply in the future the DDS-RPC standard on FastDDS. But right now the service discovery issue will be addressed by #390 solution. There are other features/issues that need our focus. |
@richiware #390 is not a solution, it is a workaround at best and seemingly not an effective one for all cases (see ros2/rclcpp#1152 and ros2/ros2#931). To the extent that it is possible, please, do consider addressing this issue at some point in the near future. |
We've added your suggestion on our roadmap. |
To summarize issues from ros2/ros2#931 with respect to navigation reported by @daisukes Upgrading to 2.x.x from 1.x.x results in services failing to function (<5% successful calls), which is a safety issue with fast-dds. Rolling back fixes this. |
@richiware With the report mention in the previous comment I think there is a clear regression which should be addressed asap. |
@MiguelCompany could you help in this? |
@JaimeMartin can we get an update on this? This is a safety critical issue and we may have to switch primary support and instructions for Navigation2 to Cyclone if we can't reliably call services in Foxy. It could genuinely cause complete system failure. |
@SteveMacenski Looking at ros2/ros2#931 I see all the analysis was done before #390 was merged. I know it is a workaround and it has not made its way to a Foxy patch release (it is only in master), but have you checked if it improves the behavior? Regarding the commit that seems to be responsible, it is the merge commit of eProsima/Fast-DDS#1190. Among other things, that PR fixes a wrong behavior with volatile late-joiners. Before the PR, volatile late-joiners were (incorrectly) receiving old samples in some cases, and fixing that to be compliant with RTPS standard, makes the service discovery race more likely to happen. Commits affecting volatile late-joiners are eProsima/Fast-DDS@a1c35b5 and eProsima/Fast-DDS@30f8f6f Nevertheless, we have just started analyzing ros-navigation/navigation2#1772 and, at the same time, designing a proper service discovery on rmw_fastrtps. |
We also had to drop some unit tests as well because of this issue causing them to fail with Fast-RTPS only ros-navigation/navigation2#1856 and reports of various elements of navigation2 not working (recovery behaviors) due to services not working. Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)? |
The workaround in #390 dropped from my radar. I've added it to the project board and will make sure it gets into the next Foxy sync. |
@dirk-thomas @jacobperron As you can see here the issue has been solved by eProsima/Fast-DDS#1295. I hope this can be released in foxy ASAP. |
@MiguelCompany With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)? |
@jacobperron No, the changes in #390 are still necessary, as there are still chances of asymmetric discovery of endpoints. Nevertheless, we are preparing a design for a different approach to how the service discovery race is handled. Although the new mechanism will be inspired by the DDS-RPC standard, it will not strictly follow it. We will open a PR in a few days with the proposed design. |
@MiguelCompany If you can create a new release tag for the Fast-DDS |
I would suggest to wait until eProsima/Fast-DDS#1304 has landed too (which should hopefully be soon - CI is running as we speak). |
@dirk-thomas @jacobperron The warnings related PRs have been merged, and I have created tag |
We have done the design to address the problem with the service discovery. You can find it documented on PR #418 |
Hi, I am facing several issues with service reliability when using ROS2 Galactic with Fast DDS over SHM. Also using discovery server. Most common occurrences:
Is there anything on the ROS2 roadmap that will help address service reliability in the near term? We just switched to ROS2 and this issue is hurting our ability to build a stack around it, because being able to have a service driven infrastructure is critical to efficient use of CPU on low compute platforms. |
HI @aditya2592 , are you guys running into these issues while using docker or also on a normal installation? |
Hi @MarcoMatteoBassa, yes we are also running through a docker, thats awesome that you were able to reproduce it. I have been having trouble doing that because it is indeed effected by system load and low compute platforms. On top of that, everything we have so far is on the same system ie on localhost itself. Because of these issues on the same machine, we havent been able to expand to ROS2 across machines. We also switched to using the Fast DDS discovery server with below config as suggested in the tutorials, though I am still unsure why discovery server would be needed to discover services on the same machine
|
@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? |
@MiguelCompany this could be related to #616? |
We had a lot more issues with cyclone so we stopped experimenting with that early on. There were memory leaks and general latency was not good with cyclone. But I have seen issues on both galactic and humble using Fast DDS with/without discovery server, with all services and clients on the same machine |
@MarcoMatteoBassa Were you able to run your experiment with the fix in this MR - #616 ? |
Hi @aditya2592 @fujitatomoya , I just tried to use the latest version of osrf/ros2:nightly (verified that the specified fix is there) as base for my test in https://github.com/MarcoMatteoBassa/reproduce_client_bug but I can still reproduce the issue |
@MarcoMatteoBassa thanks for checking, i cannot reproduce with your test but technically there is a race condition for service discovery as described on this thread. |
@fujitatomoya thanks for trying, JFI, here is the corresponding issue (opend on the docker images repo because I can only reproduce it on docker) osrf/docker_images#628 |
The change in #616 has been merged, but not released yet. That means it is definitely not in the docker images yet. We'll have to do a new release of In the meantime, @MarcoMatteoBassa is there any chance you can test from sources instead? You can follow the instructions at https://docs.ros.org/en/rolling/Installation/Alternatives/Ubuntu-Development-Setup.html to do that. |
@MarcoMatteoBassa I incorporated the fix into my Docker image like below by only building
Also another thing I am trying is to limit transports to only shared memory. This for sure reduced service request/response getting dropped on one of our PCs but they still happen occasionally:
|
@clalancette I verified that on osrf/ros2:nightly the installed rmw_fastrtps_shared_cpp/custom_client_info.hpp and rmw_fastrtps_shared_cpp/custom_service_info.hpp were modified as in the fix https://github.com/ros2/rmw_fastrtps/pull/616/files. |
Hi @clalancette , just tested with a full src installation, but this is unfortunately still reproducible |
Update from my end on this, used the following config to completely eliminate any transport except SHM and also assigned a 75MB segment size. I also had to make all service callbacks fall in the
Even though this works for our use case right now, services should be reliable for any transport on ROS ideally.
|
Is this planning to be fixed in Humble? |
Feature request
Feature description
Follow-up issue after #390. Prior to that PR, services' discovery would rely on built-in request/reply topics discovery, which can and did lead to races (see ros2/ros2#922). A short-term workaround was then introduced by #390, but not without caveats i.e. blocking behavior of
rmw_send_response
, non-standard (ab)use of sample identity data.OMG DDS-RPC 1.0 spec describes both the issues associated with independent discovery of request/reply topics (see Basic Service Mapping under section 7.6.1) and an algorithm to prevent them (see Enhanced Service Mapping under section 7.6.2). Services' implementation in
rmw_fastrtps*_cpp
could and probably should move in that direction.The text was updated successfully, but these errors were encountered: