Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclcpp_components: test_component_manager_api consistently times out #863

Closed
mjcarroll opened this issue Sep 20, 2019 · 20 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@mjcarroll
Copy link
Member

mjcarroll commented Sep 20, 2019

Bug report

Required Info:

  • Operating System:
    • Linux
  • Installation type:
    • Source
  • Version or commit hash:
    • Dashing
    • Eloquent
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

  • Execute test_component_manager_api

Expected behavior

Test does not time out

Actual behavior

Test times out

Additional information

This was originally reported as a build_cop issue: ros2/build_farmer#184

It's recently become more flaky, causing issues with PR builders and CI jobs (#857) with an attempted fix here #862

@mjcarroll mjcarroll added the bug Something isn't working label Sep 20, 2019
@mjcarroll mjcarroll self-assigned this Sep 20, 2019
@clalancette clalancette changed the title rclcpp_compnents: test_api consistently times out rclcpp_components: test_api consistently times out Sep 20, 2019
@hidmic
Copy link
Contributor

hidmic commented Sep 24, 2019

So, it's reported for Fast-RTPS and I cannot reproduce it locally using any other rmw implementation i.e. RTI Connext and ADLink Opensplice work just fine. I suspect it has to do with the time it takes for Fast-RTPS to bring up / tear down node state, but I have to dig in further.

@hidmic
Copy link
Contributor

hidmic commented Sep 25, 2019

Alright, some debugging suggests it's a race. Running just two tests, the first one passes and the second gets as far as sending a service request. The service server never gets that request, spinning times out and the test ends up waiting for an std::future that never comes (no pun intended).

I went into Fast-RTPS, looking for global state that could explain an interaction between consecutively run tests but I couldn't find any. Maybe eProsima folks can shed some light.

@hidmic
Copy link
Contributor

hidmic commented Sep 25, 2019

@richiware maybe?

@richiware
Copy link

We are backporting fixed issues on FastRTPS 1.9 to 1.8. ROS2 Dashing is using FastRTPS 1.8.x releases. Can you test with this branch? It will be next 1.8.2 release. In case it still fails, I will try to reproduce the problem and take a look.

@hidmic
Copy link
Contributor

hidmic commented Oct 1, 2019

Thanks for reaching out @richiware! But it seems to me you're suggesting that this is specific to Fast-RTPS 1.8.x and, by extension, to Dashing. This is also happening on Eloquent. I just noticed the issue was not clear in this regard -- it's updated now.

@hidmic
Copy link
Contributor

hidmic commented Oct 7, 2019

FYI no other flakes have happened after #876. We could close this issue BUT I think it'd be good to track the underlying race which we are simply working around now.

@richiware
Copy link

Yesterday I was able to reproduce it. I will try to figure out what's going on.

@richiware
Copy link

I was testing modifying the default Durability QoS for services. Testing with TRANSIENT_LOCAL instead of VOLATILE makes pass all tests.
Therefore the service's request is sent before services are discovered between them (at DDS level). When the discovered is complete, due to Durability QoS is VOLATILE, there is no request to be sent again.

@richiware
Copy link

It is weird because looking in the test's source code I've found wait_for_service() is being used. I'm not familiar with ROS services API, but this function should wait until the discovery is complete between client and server.

@hidmic
Copy link
Contributor

hidmic commented Oct 9, 2019

Therefore the service's request is sent before services are discovered between them (at DDS level).

Well, to prevent this from ever happening in the first place, each test does check for service server availability before moving forward e.g.:

auto client = node->create_client<composition_interfaces::srv::LoadNode>(
"/ComponentManager/_container/load_node");
if (!client->wait_for_service(20s)) {
ASSERT_TRUE(false) << "service not available after waiting";
}

I can confirm that (1) it does pass that sanity check right before losing the request and (2) the entire issue goes away if enough time is spent between node tear down and bring up.


All of this seems to suggest that graph updates are driven solely by network discovery. In other words, there's no intraprocess optimization. Thus the graph is essentially out-of-date until a full discovery is performed. Is that reasoning sound to you @richiware ?

@richiware
Copy link

Sounds plausible. If the graph doesn't have time to be informed about the teardown of the service, maybe wait_for_service() is returning immediately, when it shouldn't.

@dirk-thomas dirk-thomas changed the title rclcpp_components: test_api consistently times out rclcpp_components: test_component_manager_api consistently times out Oct 9, 2019
@dirk-thomas
Copy link
Member

From a quick look at both tests in rclcpp_components I noticed that they only call rclcpp::init. How about adding the corresponding rclcpp::shutdown call?

If either of you was able to reproduce the test failure before please consider trying it again with the patch proposed in #885.

@hidmic
Copy link
Contributor

hidmic commented Oct 9, 2019

@dirk-thomas SetUpTestCase() and TearDownTestCase() get called once per test case, where test case in gtest means all the tests associated with a ::testing::TestCase fixture subclass (not the same as SetUp() and TearDown()). Thus, in a setting like the one we had before #876, it won't make a difference. We could init() and shutdown() before and after each test though.

I'm concerned about the underlying issue though. If

All of this seems to suggest that graph updates are driven solely by network discovery. In other words, there's no intraprocess optimization. Thus the graph is essentially out-of-date until a full discovery is performed.

is true, it breaks wait_for_*() methods' contract (or the expectation at least).

@wjwwood
Copy link
Member

wjwwood commented Oct 9, 2019

I agree with @hidmic, I would be surprised if shutdown in TearDownTestCase() made a difference, but we can still try it out.

I think it's more likely to be a discovery race condition, no? Like services from previous runs are still around at the beginning of the next test (that would explain why sleeping between tests fixes it). Unless fast rtps does something to "flush" things out in shutdown, I don't think even init/shutdown would fix this, but I could be wrong.

@hidmic
Copy link
Contributor

hidmic commented Oct 9, 2019

Like services from previous runs are still around at the beginning of the next test (that would explain why sleeping between tests fixes it)

That's what I thought, but after browsing Fast-RTPS code for a while I can't realize how's that possible. And if that's the case, I think it'd be really inconvenient if there's no way to establish a "reliable" link.

@dirk-thomas
Copy link
Member

Like services from previous runs are still around at the beginning of the next test

That exact case would benefit from this patch. In the repeated CI builds we run a test N times. So when the test itself doesn't shutdown correctly and doesn't unadvertise itself it might affect the re-run of itself.

@wjwwood
Copy link
Member

wjwwood commented Oct 9, 2019

That exact case would benefit from this patch. In the repeated CI builds we run a test N times. So when the test itself doesn't shutdown correctly and doesn't unadvertise itself it might affect the re-run of itself.

But I don't think shutdown does anything additional that destroying the node doesn't do. I agree something needs to be done, but I don't think shutdown will do anything. But again we can certainly test it.

@wjwwood
Copy link
Member

wjwwood commented Oct 9, 2019

In #885 (comment) @dirk-thomas said:

Without a shutdown the node won't communicate to other entities that it is going down.

Destroying the node is all that is needed. Shutdown (as far as I know) does nothing additional to signal to other nodes that something no longer exists.

@hidmic
Copy link
Contributor

hidmic commented Oct 17, 2019

@richiware is Domain::removeParticipant supposed to complete the task in full, graph pruning included, before returning? If not (as I suspect), how can we make it so? That is, make sure the Participant has actually been teared down and no trace of it remains.

@jacobperron
Copy link
Member

This bug sounds related to the more general issues we have with services, see ros2/ros2#922 and ros2/ros2#931.

Since the workaround (#876) seems to have fixed this particular test failure, and we have other issues tracking the issues related to services, I'm going to close this. Feel free to comment and re-open if you think this is a mistake.

nnmm pushed a commit to ApexAI/rclcpp that referenced this issue Jul 9, 2022
* Skip the test on CentOS.

Instead of trying to fix the test on CentOS, just skip it.
This relies on a file called /etc/os-release being available
(which it is on CentOS and Ubuntu).  If we see that file, we
parse it and find the ID.  If we see that the ID is "centos"
(including the quotes), we skip compiling and running the
test_node tests.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
DensoADAS pushed a commit to DensoADAS/rclcpp that referenced this issue Aug 5, 2022
… out (ros2#863)

Signed-off-by: Emerson Knapp <eknapp@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants