rclcpp_components: test_component_manager_api consistently times out #863

mjcarroll · 2019-09-20T21:07:05Z

Bug report

Required Info:

Operating System:
- Linux
Installation type:
- Source
Version or commit hash:
- Dashing
- Eloquent
DDS implementation:
- Fast-RTPS
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Execute test_component_manager_api

Expected behavior

Test does not time out

Actual behavior

Test times out

Additional information

This was originally reported as a build_cop issue: ros2/build_farmer#184

It's recently become more flaky, causing issues with PR builders and CI jobs (#857) with an attempted fix here #862

The text was updated successfully, but these errors were encountered:

hidmic · 2019-09-24T21:57:51Z

So, it's reported for Fast-RTPS and I cannot reproduce it locally using any other rmw implementation i.e. RTI Connext and ADLink Opensplice work just fine. I suspect it has to do with the time it takes for Fast-RTPS to bring up / tear down node state, but I have to dig in further.

hidmic · 2019-09-25T18:10:24Z

Alright, some debugging suggests it's a race. Running just two tests, the first one passes and the second gets as far as sending a service request. The service server never gets that request, spinning times out and the test ends up waiting for an std::future that never comes (no pun intended).

I went into Fast-RTPS, looking for global state that could explain an interaction between consecutively run tests but I couldn't find any. Maybe eProsima folks can shed some light.

hidmic · 2019-09-25T18:15:44Z

@richiware maybe?

richiware · 2019-09-29T20:42:32Z

We are backporting fixed issues on FastRTPS 1.9 to 1.8. ROS2 Dashing is using FastRTPS 1.8.x releases. Can you test with this branch? It will be next 1.8.2 release. In case it still fails, I will try to reproduce the problem and take a look.

hidmic · 2019-10-01T18:17:40Z

Thanks for reaching out @richiware! But it seems to me you're suggesting that this is specific to Fast-RTPS 1.8.x and, by extension, to Dashing. This is also happening on Eloquent. I just noticed the issue was not clear in this regard -- it's updated now.

hidmic · 2019-10-07T13:53:09Z

FYI no other flakes have happened after #876. We could close this issue BUT I think it'd be good to track the underlying race which we are simply working around now.

richiware · 2019-10-09T05:44:28Z

Yesterday I was able to reproduce it. I will try to figure out what's going on.

richiware · 2019-10-09T13:35:27Z

I was testing modifying the default Durability QoS for services. Testing with TRANSIENT_LOCAL instead of VOLATILE makes pass all tests.
Therefore the service's request is sent before services are discovered between them (at DDS level). When the discovered is complete, due to Durability QoS is VOLATILE, there is no request to be sent again.

richiware · 2019-10-09T14:01:37Z

It is weird because looking in the test's source code I've found wait_for_service() is being used. I'm not familiar with ROS services API, but this function should wait until the discovery is complete between client and server.

hidmic · 2019-10-09T14:02:32Z

Therefore the service's request is sent before services are discovered between them (at DDS level).

Well, to prevent this from ever happening in the first place, each test does check for service server availability before moving forward e.g.:

rclcpp/rclcpp_components/test/test_component_manager_api.cpp

Lines 48 to 53 in b8f7237

    
           auto client = node->create_client<composition_interfaces::srv::LoadNode>( 
        
             "/ComponentManager/_container/load_node"); 
        
           if (!client->wait_for_service(20s)) { 
        
             ASSERT_TRUE(false) << "service not available after waiting"; 
        
           }

I can confirm that (1) it does pass that sanity check right before losing the request and (2) the entire issue goes away if enough time is spent between node tear down and bring up.

All of this seems to suggest that graph updates are driven solely by network discovery. In other words, there's no intraprocess optimization. Thus the graph is essentially out-of-date until a full discovery is performed. Is that reasoning sound to you @richiware ?

richiware · 2019-10-09T14:18:22Z

Sounds plausible. If the graph doesn't have time to be informed about the teardown of the service, maybe wait_for_service() is returning immediately, when it shouldn't.

dirk-thomas · 2019-10-09T15:35:09Z

From a quick look at both tests in rclcpp_components I noticed that they only call rclcpp::init. How about adding the corresponding rclcpp::shutdown call?

If either of you was able to reproduce the test failure before please consider trying it again with the patch proposed in #885.

hidmic · 2019-10-09T15:57:13Z

@dirk-thomas SetUpTestCase() and TearDownTestCase() get called once per test case, where test case in gtest means all the tests associated with a ::testing::TestCase fixture subclass (not the same as SetUp() and TearDown()). Thus, in a setting like the one we had before #876, it won't make a difference. We could init() and shutdown() before and after each test though.

I'm concerned about the underlying issue though. If

All of this seems to suggest that graph updates are driven solely by network discovery. In other words, there's no intraprocess optimization. Thus the graph is essentially out-of-date until a full discovery is performed.

is true, it breaks wait_for_*() methods' contract (or the expectation at least).

wjwwood · 2019-10-09T21:31:22Z

I agree with @hidmic, I would be surprised if shutdown in TearDownTestCase() made a difference, but we can still try it out.

I think it's more likely to be a discovery race condition, no? Like services from previous runs are still around at the beginning of the next test (that would explain why sleeping between tests fixes it). Unless fast rtps does something to "flush" things out in shutdown, I don't think even init/shutdown would fix this, but I could be wrong.

hidmic · 2019-10-09T21:36:01Z

Like services from previous runs are still around at the beginning of the next test (that would explain why sleeping between tests fixes it)

That's what I thought, but after browsing Fast-RTPS code for a while I can't realize how's that possible. And if that's the case, I think it'd be really inconvenient if there's no way to establish a "reliable" link.

dirk-thomas · 2019-10-09T21:45:53Z

Like services from previous runs are still around at the beginning of the next test

That exact case would benefit from this patch. In the repeated CI builds we run a test N times. So when the test itself doesn't shutdown correctly and doesn't unadvertise itself it might affect the re-run of itself.

wjwwood · 2019-10-09T21:48:53Z

That exact case would benefit from this patch. In the repeated CI builds we run a test N times. So when the test itself doesn't shutdown correctly and doesn't unadvertise itself it might affect the re-run of itself.

But I don't think shutdown does anything additional that destroying the node doesn't do. I agree something needs to be done, but I don't think shutdown will do anything. But again we can certainly test it.

wjwwood · 2019-10-09T21:50:48Z

In #885 (comment) @dirk-thomas said:

Without a shutdown the node won't communicate to other entities that it is going down.

Destroying the node is all that is needed. Shutdown (as far as I know) does nothing additional to signal to other nodes that something no longer exists.

hidmic · 2019-10-17T14:32:33Z

@richiware is Domain::removeParticipant supposed to complete the task in full, graph pruning included, before returning? If not (as I suspect), how can we make it so? That is, make sure the Participant has actually been teared down and no trace of it remains.

jacobperron · 2020-06-04T16:42:07Z

This bug sounds related to the more general issues we have with services, see ros2/ros2#922 and ros2/ros2#931.

Since the workaround (#876) seems to have fixed this particular test failure, and we have other issues tracking the issues related to services, I'm going to close this. Feel free to comment and re-open if you think this is a mistake.

* Skip the test on CentOS. Instead of trying to fix the test on CentOS, just skip it. This relies on a file called /etc/os-release being available (which it is on CentOS and Ubuntu). If we see that file, we parse it and find the ID. If we see that the ID is "centos" (including the quotes), we skip compiling and running the test_node tests. Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>

… out (ros2#863) Signed-off-by: Emerson Knapp <eknapp@amazon.com>

mjcarroll added the bug Something isn't working label Sep 20, 2019

mjcarroll self-assigned this Sep 20, 2019

mjcarroll mentioned this issue Sep 20, 2019

test_component_manager_api failed on Linux repeated ros2/build_farmer#184

Closed

clalancette changed the title ~~rclcpp_compnents: test_api consistently times out~~ rclcpp_components: test_api consistently times out Sep 20, 2019

mjcarroll mentioned this issue Sep 24, 2019

[rclcpp_components] Enable intra-process comm via LoadNode request. #871

Merged

hidmic mentioned this issue Sep 26, 2019

Aggregate all component manager API tests. #876

Merged

zmichaels11 mentioned this issue Sep 26, 2019

Remove timeout on tests #862

Closed

dirk-thomas changed the title ~~rclcpp_components: test_api consistently times out~~ rclcpp_components: test_component_manager_api consistently times out Oct 9, 2019

dirk-thomas mentioned this issue Oct 9, 2019

call rclcpp::shutdown() after tests #885

Closed

jacobperron closed this as completed Jun 4, 2020

DensoADAS pushed a commit to DensoADAS/rclcpp that referenced this issue Aug 5, 2022

Restructure test_play_timing to one test per case, to see which times…

307bdc0

… out (ros2#863) Signed-off-by: Emerson Knapp <eknapp@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rclcpp_components: test_component_manager_api consistently times out #863

rclcpp_components: test_component_manager_api consistently times out #863

mjcarroll commented Sep 20, 2019 •

edited by hidmic

Loading

hidmic commented Sep 24, 2019

hidmic commented Sep 25, 2019

hidmic commented Sep 25, 2019

richiware commented Sep 29, 2019

hidmic commented Oct 1, 2019

hidmic commented Oct 7, 2019

richiware commented Oct 9, 2019

richiware commented Oct 9, 2019

richiware commented Oct 9, 2019

hidmic commented Oct 9, 2019 •

edited

Loading

richiware commented Oct 9, 2019

dirk-thomas commented Oct 9, 2019

hidmic commented Oct 9, 2019

wjwwood commented Oct 9, 2019

hidmic commented Oct 9, 2019

dirk-thomas commented Oct 9, 2019

wjwwood commented Oct 9, 2019

wjwwood commented Oct 9, 2019

hidmic commented Oct 17, 2019 •

edited

Loading

jacobperron commented Jun 4, 2020

rclcpp_components: test_component_manager_api consistently times out #863

rclcpp_components: test_component_manager_api consistently times out #863

Comments

mjcarroll commented Sep 20, 2019 • edited by hidmic Loading

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

hidmic commented Sep 24, 2019

hidmic commented Sep 25, 2019

hidmic commented Sep 25, 2019

richiware commented Sep 29, 2019

hidmic commented Oct 1, 2019

hidmic commented Oct 7, 2019

richiware commented Oct 9, 2019

richiware commented Oct 9, 2019

richiware commented Oct 9, 2019

hidmic commented Oct 9, 2019 • edited Loading

richiware commented Oct 9, 2019

dirk-thomas commented Oct 9, 2019

hidmic commented Oct 9, 2019

wjwwood commented Oct 9, 2019

hidmic commented Oct 9, 2019

dirk-thomas commented Oct 9, 2019

wjwwood commented Oct 9, 2019

wjwwood commented Oct 9, 2019

hidmic commented Oct 17, 2019 • edited Loading

jacobperron commented Jun 4, 2020

mjcarroll commented Sep 20, 2019 •

edited by hidmic

Loading

hidmic commented Oct 9, 2019 •

edited

Loading

hidmic commented Oct 17, 2019 •

edited

Loading