-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly failure on aarch64 rcl.TestTwoExecutablesAfterShutdown.test_services (from rcl.test_services__rmw_cyclonedds_cpp_Release.launch_tests) #611
Comments
I now have found a repeatable failure, but at the one in 20 level. With docker installed there's extra console output that does not make it into our builds console logs nor the test output. However this also appears in the passing tests.
Looping this script it failed on the 22nd run the first time the 56th time the second loop the 43rd iteration: 22, 56, 43, 38, 31
Which shows moderately high correlation to stress but not always. We appear to be deailing with a race condition of some sort. Looking at the logs with timestamps it appears neither the client nor the server finish. |
Debugging #611 these timeouts were so long that the overall test timeout was being triggered before these internal timeouts were triggered. 1000 retries at 100ms -> 10 seconds each. I cut it down to 1 second for each to establish the connection. Signed-off-by: Tully Foote <tfoote@osrfoundation.org>
With the stricter limits in #613 I can now focus the issue down to the client becoming ready:
Which means that the call to this function is timing out: rcl/rcl/test/rcl/client_fixture.cpp Line 59 in 6ca6545
It's reproducible in CI https://ci.ros2.org/job/ci_linux-aarch64/5645/consoleFull
|
Possibly related: ros2/system_tests#420 I've seen issues in several places related to services, in particular clients talking to servers. |
Debugging #611 these timeouts were so long that the overall test timeout was being triggered before these internal timeouts were triggered. 1000 retries at 100ms -> 10 seconds each. I cut it down to 1 second for each to establish the connection. Signed-off-by: Tully Foote <tfoote@osrfoundation.org>
Debugging #611 these timeouts were so long that the overall test timeout was being triggered before these internal timeouts were triggered. 1000 retries at 100ms -> 10 seconds each. I cut it down to 1 second for each to establish the connection. Signed-off-by: Tully Foote <tfoote@osrfoundation.org>
It doesn't look like the test has failed since ros2/system_tests#420 was resolved. Closing this out. |
This has failed two nightlys in a row
https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_release/1126/testReport/rcl/TestTwoExecutablesAfterShutdown/test_services/
I've been unable to reproduce this on an aws aarch64 bionic machine but have been able to reproduce it in CI on focal
https://ci.ros2.org/job/ci_linux-aarch64/5630
Using this gist: https://gist.github.com/tfoote/dc6c33491124e863f554531b87b0492e
And reducing the build and test to rcl.
And I've validated that it also fails on Bionic on CI: https://ci.ros2.org/job/ci_linux-aarch64/5636/console
The text was updated successfully, but these errors were encountered: