-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rclcpp_components: test_component_manager_api consistently times out #863
Comments
So, it's reported for Fast-RTPS and I cannot reproduce it locally using any other |
Alright, some debugging suggests it's a race. Running just two tests, the first one passes and the second gets as far as sending a service request. The service server never gets that request, spinning times out and the test ends up waiting for an I went into Fast-RTPS, looking for global state that could explain an interaction between consecutively run tests but I couldn't find any. Maybe eProsima folks can shed some light. |
@richiware maybe? |
Thanks for reaching out @richiware! But it seems to me you're suggesting that this is specific to Fast-RTPS 1.8.x and, by extension, to Dashing. This is also happening on Eloquent. I just noticed the issue was not clear in this regard -- it's updated now. |
FYI no other flakes have happened after #876. We could close this issue BUT I think it'd be good to track the underlying race which we are simply working around now. |
Yesterday I was able to reproduce it. I will try to figure out what's going on. |
I was testing modifying the default Durability QoS for services. Testing with TRANSIENT_LOCAL instead of VOLATILE makes pass all tests. |
It is weird because looking in the test's source code I've found |
Well, to prevent this from ever happening in the first place, each test does check for service server availability before moving forward e.g.: rclcpp/rclcpp_components/test/test_component_manager_api.cpp Lines 48 to 53 in b8f7237
I can confirm that (1) it does pass that sanity check right before losing the request and (2) the entire issue goes away if enough time is spent between node tear down and bring up. All of this seems to suggest that graph updates are driven solely by network discovery. In other words, there's no intraprocess optimization. Thus the graph is essentially out-of-date until a full discovery is performed. Is that reasoning sound to you @richiware ? |
Sounds plausible. If the graph doesn't have time to be informed about the teardown of the service, maybe |
From a quick look at both tests in If either of you was able to reproduce the test failure before please consider trying it again with the patch proposed in #885. |
@dirk-thomas I'm concerned about the underlying issue though. If
is true, it breaks |
I agree with @hidmic, I would be surprised if shutdown in I think it's more likely to be a discovery race condition, no? Like services from previous runs are still around at the beginning of the next test (that would explain why sleeping between tests fixes it). Unless fast rtps does something to "flush" things out in shutdown, I don't think even init/shutdown would fix this, but I could be wrong. |
That's what I thought, but after browsing Fast-RTPS code for a while I can't realize how's that possible. And if that's the case, I think it'd be really inconvenient if there's no way to establish a "reliable" link. |
That exact case would benefit from this patch. In the repeated CI builds we run a test N times. So when the test itself doesn't shutdown correctly and doesn't unadvertise itself it might affect the re-run of itself. |
But I don't think shutdown does anything additional that destroying the node doesn't do. I agree something needs to be done, but I don't think shutdown will do anything. But again we can certainly test it. |
In #885 (comment) @dirk-thomas said:
Destroying the node is all that is needed. Shutdown (as far as I know) does nothing additional to signal to other nodes that something no longer exists. |
@richiware is |
This bug sounds related to the more general issues we have with services, see ros2/ros2#922 and ros2/ros2#931. Since the workaround (#876) seems to have fixed this particular test failure, and we have other issues tracking the issues related to services, I'm going to close this. Feel free to comment and re-open if you think this is a mistake. |
* Skip the test on CentOS. Instead of trying to fix the test on CentOS, just skip it. This relies on a file called /etc/os-release being available (which it is on CentOS and Ubuntu). If we see that file, we parse it and find the ID. If we see that the ID is "centos" (including the quotes), we skip compiling and running the test_node tests. Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
… out (ros2#863) Signed-off-by: Emerson Knapp <eknapp@amazon.com>
Bug report
Required Info:
Steps to reproduce issue
test_component_manager_api
Expected behavior
Test does not time out
Actual behavior
Test times out
Additional information
This was originally reported as a build_cop issue: ros2/build_farmer#184
It's recently become more flaky, causing issues with PR builders and CI jobs (#857) with an attempted fix here #862
The text was updated successfully, but these errors were encountered: