Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

test_cli.test_params_yaml flaky #166

Closed
sloretz opened this issue Feb 21, 2019 · 7 comments · Fixed by ros2/system_tests#369
Closed

test_cli.test_params_yaml flaky #166

sloretz opened this issue Feb 21, 2019 · 7 comments · Fixed by ros2/system_tests#369
Labels
duplicate This issue or pull request already exists

Comments

@sloretz
Copy link

sloretz commented Feb 21, 2019

Failures in nightly_linux-aarch64_debug #734, nightly_linux-aarch64_extra_rmw_release #266, nightly_linux-aarch64_release #694, nightly_linux_debug #1109, nightly_linux_extra_rmw_release #262, nightly_linux_release #1091, nightly_osx_debug #1165, nightly_osx_extra_rmw_release #303, nightly_osx_release #1172, nightly_win_deb #1165, nightly_win_extra_rmw_rel #269, nightly_win_rel #1104, nightly_xenial_linux-aarch64_release #264, nightly_xenial_linux_release #255

projectroot.test_client
rclcpp_action.test_client.gtest.missing_result

Failures in nightly_linux-aarch64_extra_rmw_release #266, nightly_linux-aarch64_release #694, nightly_linux_extra_rmw_release #262, nightly_linux_release #1091, nightly_win_extra_rmw_rel #269, nightly_win_rel #1104, nightly_xenial_linux-aarch64_release #264, nightly_xenial_linux_release #255

projectroot.test_params_yaml
test_cli.test_params_yaml.xunit.missing_result

@dirk-thomas narrowed it down to something in FastRTPS 1.7.1 @richiprosima FYI
Build Status Fast-RTPS f2516be21df726baac4da22c39a6395f06164594
Build Status Fast-RTPS b48ce9d2fba6fc94e756da01c58b72f2ad848238
Build Status Fast-RTPS 846307b9baaaff49ad5d3aee68b9cb509851e4f3
Build Status Fast-RTPS d2ec94e7a665d689b24e8ae71c9b83c9efec45db
Build Status Fast-RTPS bc61fa3fab63ced48416db86ebb8df4e0f0cebdf
Build Status Fast-RTPS 92f2e37ef682988e5cfc4f320de4712630893009
Build Status Fast-RTPS 5827bb94408f3d5d2284d2e77b482ff84738ef3f
Build Status Fast-RTPS cb0a6e237eee4985ef8b1ffe6afbebd8e4116d04

@sloretz
Copy link
Author

sloretz commented Feb 21, 2019

Looks like the tests pass before eProsima/Fast-DDS#411 was merged

@richiware
Copy link

I was debugging the problem. @sloretz, as your said, it is related to the commit where the default values of reliable communications for discovery endpoints have changed. Right now in the simple scenario of one publisher and one subscriber, the discovery take ~100milliseconds more. But the performance improves according to the number of publisher and subscriber increase.

Your test creates a client and immediately send the request. Now this request is sent before the discovery of publisher and subscriber. Sniffing the RTPS messages I saw the publisher is VOLATILE. Then after it sends the request to no one, it removes internally it. When the subscriber is discovered, the publisher doesn't have the request.

@sloretz
Copy link
Author

sloretz commented Feb 22, 2019

@richiware Thanks for looking into it. Sounds like this is a problem with the tests rather than Fast-RTPS.

@sloretz
Copy link
Author

sloretz commented Feb 27, 2019

It looks like the test_cli test is waiting for the server before sending a request, so double checking to see if that failure was exposed by the change in fastrtps or just a coincidence

  • Build Status Fast-RTPS bc61fa3fab63ced48416db86ebb8df4e0f0cebdf
  • Build Status Fast-RTPS master

@wjwwood
Copy link
Member

wjwwood commented Mar 7, 2019

I just got a failure in test_cli which I think is not related to the pr that was being tested, though it's hard to say if that's the case or not:

ros2/launch#193 (comment)

Was this just a case of flakyness or a consistent issue? It looks like it is in the nightly repeated, so I'm guessing flaky:

https://ci.ros2.org/view/nightly/job/nightly_linux_repeated/1314/testReport/

@sloretz
Copy link
Author

sloretz commented Mar 7, 2019

@wjwwood it's been failing since eProsima/Fast-DDS#411. I tried to root cause it, but couldn't figure it out. It appears to be a race condition in the test itself rather than a fastrtps issue.

@dirk-thomas
Copy link
Member

The test seems to fail with FastRTPS only.

Repeat-until-pass builds:

  • Connext Build Status
  • FastRTPS Build Status
  • OpenSplice Build Status

Repeat-until-fail builds:

  • Connext Build Status
  • FastRTPS Build Status
  • OpenSplice Build Status

The failing test is waiting for the client to get a response on the sent request for the parameters while the service side never receives the request. The most likely reason I can see is that the client sends the request before the subscription for the request from the server has been matched (basically sending the request to /dev/null).

I thought the following snippet would already make sure that the publisher / subscriber on the request topic have actually been matched before proceeding with sending the request: https://github.com/ros2/rmw_fastrtps/blob/ea4fb7cfaa671a2118a7d54203c786d890d8ed38/rmw_fastrtps_shared_cpp/src/rmw_service_server_is_available.cpp#L107-L114 This logic was added in ros2/rmw_fastrtps/pull/238. I

Comparing that logic with the code in the publisher/subscriber listeners which was added in ros2/rmw_fastrtps#234 the logic using a set of endpoints seems more robust in case events are happening multiple times.

With the patch in ros2/rmw_fastrtps/pull/262 the test seem to be at least less flaky:

Repeat-until-pass builds:

  • FastRTPS Build Status

Repeat-until-fail builds:

  • FastRTPS Build Status

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants