Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-RTPS services and network discovery regression (Local costmap not appearing or clear costmap not called) #1772

Closed
dkuenster opened this issue May 27, 2020 · 28 comments
Labels
1 - High High Priority bug Something isn't working

Comments

@dkuenster
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Debian 10 Buster
  • Version or commit hash:
  • DDS implementation:
    • Fast-RTPS

Steps to reproduce issue

use tb3_simulation_launch.py to start the gazebo simulation and nav2 stack. Then use "2D Pose Estimate to localize the robot".

Expected behavior

Local Costmap should show every time, e.g:
correct_launch

Actual behavior

In more than 50% of all initializations the Local Costmap doesn't show, eg:
faulty_launch

Echoing /local_costmap/costmap shows the costmap constains all zeroes despite being in the same position as in the working case, where it contains actual values.
Rviz doesn't report any issues with the topics.

Additional information

console_output_empty_costmap.txt
console_output_working_costmap.txt

I can't find any differences or errors in the console output. Can anyone reproduce the issue or has any idea what is happening?

@SteveMacenski
Copy link
Member

SteveMacenski commented May 27, 2020

Is this the same rviz window or a new one on a new navigation launch? Can you verify that if you toggle the rviz display types for the costmap (or relaunch rviz) that it appears? I think what you're seeing has nothing to do with navigation but rather a failure in the visualization tools.

Even 0s would show up here with a boundary because of the changes in transparency between the 2 costmap settings. I think you would see 0s on the costmap in your pictures if the costmap were actually being shown.

Not sure it relates, but Buster is also not a Tier 1 supported OS so it may be that the DDS vendors / RMW layers don't do detection properly on that or something. Not sure its related, but could certainly be.

I was 4/4 in launching them just now - so might want to take a second look and make sure what's happening is what you think is happening.

@naiveHobo
Copy link
Contributor

Not sure it relates, but Buster is also not a Tier 1 supported OS

I face this issue too on Ubuntu 18.04 so it shouldn't be related to Buster.

@naiveHobo
Copy link
Contributor

I looked into this a bit more, the local costmap is set to all 0s for some reason.

As @dkuenster said, this happens in more than 50% of the times when starting up the simulation. When this happens, only the static layer seems to be working for the global costmap as well. The local costmap also has non-zero values if a static layer is added to it and it shows up every time.

I'm not completely sure what the error with the rest of the layers is exactly, trying to look into it.

Screenshot from 2020-05-28 02-12-40

@SteveMacenski
Copy link
Member

Your image makes it seem like the laser is up and running given that I see some red in the center pole that is off the map (from robot localization quality and laserscans). But I get your point if that happens but this isn't a good example of that.

Let me know what you find out.

@dkuenster
Copy link
Contributor Author

Is this the same rviz window or a new one on a new navigation launch? Can you verify that if you toggle the rviz display types for the costmap (or relaunch rviz) that it appears? I think what you're seeing has nothing to do with navigation but rather a failure in the visualization tools.

It was a new window with a new navigation launch. When switching the visualization, I get the same result that can be seen in the screenshot of @naiveHobo.

@dkuenster
Copy link
Contributor Author

dkuenster commented May 28, 2020

I also found that each time the Local Costmap doesn't appear, the Controller only gets 0 as initial velocity in the twist message of the computeVelocityCommand, despite the robot moving and the odom topic containing the correct velocities. On a start where the Local Costmap starts correctly on the other hand the actual current velocity gets passed to the controller.

The pose parameter however works correctly in both cases.

@SteveMacenski SteveMacenski added 1 - High High Priority bug Something isn't working labels May 28, 2020
@dkuenster
Copy link
Contributor Author

While echo constantly shows msgs on the "odom" topic in both cases, the OdomSubscriber in the Controller gets messages on some starts and on others the callback method never gets called. Each time it doesn't get messages, we also get the problem with the local costmap plugins, as soon as we set the initial pose.
I don't know how it is related, but something seems to go wrong before we even set an intial pose.

@dkuenster
Copy link
Contributor Author

While echo constantly shows msgs on the "odom" topic in both cases, the OdomSubscriber in the Controller gets messages on some starts and on others the callback method never gets called. Each time it doesn't get messages, we also get the problem with the local costmap plugins, as soon as we set the initial pose.
I don't know how it is related, but something seems to go wrong before we even set an intial pose.

Same problem with the LaserScanSubscriber in the Obstacle Layer. On the starts where the OdomSubscriber callback never gets called, the callback in the LaserScan subscriber also doesn't get called despite echo showing messages on "scan".

@SteveMacenski
Copy link
Member

Just to verify, what you're describing are specific instances of topics that are being published that have not yet connected to the costmaps, correct?

Can you try seeing if switching DDS vendors to Cyclone DDS resolves those issues? I'm wondering if there was a regression or an issue with the local discovery with Fast-RTPS. What version of ROS2 are you on right now (eloquent, master, foxy, etc)

@dkuenster
Copy link
Contributor Author

Just to verify, what you're describing are specific instances of topics that are being published that have not yet connected to the costmaps, correct?

Yes.

Can you try seeing if switching DDS vendors to Cyclone DDS resolves those issues? I'm wondering if there was a regression or an issue with the local discovery with Fast-RTPS. What version of ROS2 are you on right now (eloquent, master, foxy, etc)

Switching to Cyclone DDS indeed solves this problem.
Also switching back to version v1.10.0 of Fast-RTPS, as suggested in #1788 solves the problem.
So it seems to be an issue introduced in newer versions of Fast-RTPS.

@SteveMacenski
Copy link
Member

Ah ok, yeah that appears to be the same issue at #1788 and ros2/ros2#931. Can you quickly verify that the commit eProsima/Fast-DDS@a9bd1a9 is the offender? If so, we can merge these 2 tickets together and track them.

@dkuenster
Copy link
Contributor Author

Yes, it works right until commit eProsima/Fast-DDS@d5c9d6b (the commit right before eProsima/Fast-DDS@a9bd1a9) and then breaks on eProsima/Fast-DDS@a9bd1a9

@SteveMacenski
Copy link
Member

I'm rolling in the scope of #1788 into this one so we have 1 ticket per issue and renaming this issue to Fast-RTPS services and network discovery regression. We should track that upstream issue but also potentially move to Cyclone DDS for development since that doesn't exhibit the issue.

@SteveMacenski SteveMacenski changed the title Local Costmap not showing on every start Fast-RTPS services and network discovery regression (Local costmap not appearing or clear costmap not called) Jun 8, 2020
@SteveMacenski SteveMacenski added this to the Galactic Milestone milestone Jul 1, 2020
@MiguelCompany
Copy link

MiguelCompany commented Jul 10, 2020

I checked this using commit 69977cd + current ros2 master, and running several experiments. For each experiment I followed this procedure:

  1. Start wireshark capture
  2. run RMW_IMPLEMENTATION=<impl> ros2 launch nav2_bringup tb3_simulation_launch.py 2>&1 | tee console.txt
  3. Wait for everything to start (including gazebo showing the turtlebot waffle)
  4. Use 2D Pose Estimate button
  5. Wait for local_costmap status showing increasing reception counts
  6. Use navigation goal
  7. Wait for navigation to complete
  8. Close rviz
  9. Stop and export wireshark capture
  10. Move files from ~/.ros/log into /ros-log

As I work with Windows, I ran the experiments using VirtualBox to run Ubuntu Focal on a virtual machine.

I have checked with rmw_cyclonedds_cpp and rmw_fastrtps_cpp. For the latter, I have checked with eProsima/Fast-DDS@b710b1f (current head of 2.0.x branch) as long as with eProsima/Fast-DDS@d5c9d6b

I have never been able to see the expected image. Some times rviz crashed. Other times I could correctly navigate, but the local costmap was not shown. A summary of the results so far...

ROS 2 repos file rmw implementation result result files
master rmw_cyclonedds_cpp rviz crashed after step 4 here
master rmw_cyclonedds_cpp navigation complete. local costmap not shown here
master rmw_fastrtps_cpp rviz crashed after step 4 here
master rmw_fastrtps_cpp navigation complete. local costmap not shown here
Fast-DDS-d5c9d6bcd rmw_fastrtps_cpp navigation complete. local costmap not shown here
Fast-DDS-d5c9d6bcd rmw_fastrtps_cpp navigation complete. local costmap not shown here

My impression is that now that both implementations have workarounds to make services more reliable, this issue is always reproduced, so maybe there is something wrong in navigation2 that is now reproducibly failing.

NB: It would be nice if someone could check this with RTI connext

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 10, 2020

[rviz2-4] what(): InternalErrorException: Cannot create GL vertex buffer in GLHardwareVertexBuffer::GLHardwareVertexBuffer at /home/miguel/ros2_master/build/rviz_ogre_vendor/ogre-v1.12.1-prefix/src/ogre-v1.12.1/RenderSystems/GL/src/OgreGLHardwareVertexBuffer.cpp (line 46)

For rviz crashing, I can't help you on that unless its a result of the navigation2 plugins, but I don't think that's the case. If you run with debug symbols and its our fault, I'll look into it, but I think that's rviz.

Keep in mind its not just about the costmap showing up, the issue we're talking about is services, which those experiments don't do anything to measure. Services can be trivially tested without the navigation stack with some simple call-response nodes.

@daisukes thoughts? I'm not read up or tracking fast-rtps commits so those hashes or the specific changes don't mean much to me (I'm an expert in robotics, not DDS/networking). Have you reproduced the service problem at all from the reports? That's the best starting point that I have also experienced and we still see in the navigation2 CI. Once you've reproduced the problem, I think that's more clear to show that those changes actually fixes the underlying problem.

@daisukes
Copy link
Contributor

@SteveMacenski

As I investigated the commits of Fast-DDS, it worked fine until this commit.
I tested with this simple service test code
ros2/ros2#931 (comment)

terminal 1 $ ros2 launch nav2_bringup tb3_simulation_launch.py     # and give an initial position
terminal 2 $ ros2 run service_test service_test

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp 
[INFO] [1594423870.718509112] [rclcpp]: 0 Successed
[INFO] [1594423872.919848288] [rclcpp]: 0 Successed
[INFO] [1594423874.913237309] [rclcpp]: 0 Successed
...

unset RMW_IMPLEMENTATION (default Fast-RTPS)
[INFO] [1594423963.004730778] [rclcpp]: 0 service not available.
[INFO] [1594423968.228786391] [rclcpp]: 0 service not available.
[ERROR] [1594423974.496116010] [rclcpp]: 0 Failed
[INFO] [1594423979.727908774] [rclcpp]: 0 service not available
...

We also had rviz2 crash if we use the latest binary (after June 25th), so we use the source build with rviz2 v8.1.1 not v8.2.0.
I'm not sure if it is v8.2.0 problem or binary problem.
ros2/ros2@fc010c9#diff-215a2eb6c7ad8b20796a9fceb48f8cc7

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 11, 2020

Can you file a ticket if one doesnt exist on rviz2 for that? Make sure someone knows there's a problem

Thanks for the experiment and specification. That will definitely help clear things up.

@daisukes
Copy link
Contributor

FYI: I made a ticket ros2/rviz#574

@MiguelCompany
Copy link

MiguelCompany commented Jul 14, 2020

@SteveMacenski @daisukes It seems we found the issue. Could you give a try to eProsima/Fast-DDS#1295 ?

@daisukes
Copy link
Contributor

@MiguelCompany I have built the branch and confirmed that the service_test works well and also my own simulation works well with RMW_IMPLEMENTATION=rmw_fastrtps_cpp. Thank you!

@MiguelCompany
Copy link

@SteveMacenski As eProsima/Fast-DDS#1295 has been merged, and @daisukes checked correct behavior, I think this issue can be closed?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 17, 2020

@MiguelCompany has it been released into foxy?

@SteveMacenski SteveMacenski reopened this Jul 17, 2020
@MiguelCompany
Copy link

@MiguelCompany has it been released into foxy?

I don't think so, but I think we should ask @jacobperron about it.

@SteveMacenski
Copy link
Member

@naiveHobo there's been a foxy sync so this might be OK now

@jacobperron
Copy link
Contributor

Fast-DDS 2.0.0 is currently version in Foxy. Once a 2.0.1 tag exists, we can make a new release containing eProsima/Fast-DDS#1295.

@MiguelCompany
Copy link

@jacobperron v2.0.1 has been released, please go ahead 😉

@MiguelCompany
Copy link

@SteveMacenski @daisukes v2.0.1 has long ago been released into foxy. This and related issues should have been solved.

@SteveMacenski
Copy link
Member

I confirmed its been released now - closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - High High Priority bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants