Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test in CI: test_rclcpp gtest_subscription__rmw_connextdds #136

Open
clalancette opened this issue Nov 13, 2023 · 4 comments
Open
Assignees

Comments

@clalancette
Copy link
Contributor

clalancette commented Nov 13, 2023

System Info

  • OS
    • Ubuntu 22.04
  • ROS version and installation type
    • rolling/source
  • RTI Connext DDS version and installation type
    • Connext Pro 6.0.1
  • RMW version or commit hash
    • rolling branch (a6053be as of this writing)

Bug Description

If you run the gtest_subscription__rmw_connextdds test from the test_rclcpp package with no load, it "always" passes. It also "always" seems to pass in CI. However, if you put load on the machine, it fails very often, maybe like 75% of the time.

Expected Behavior

The test always passes, even with a lot of load on the machine.

How to Reproduce

In terminal 1, put a lot of stress on the machine. In my case:

$ stress --cpu 16

In terminal 2, run the test:

$ colcon test --event-handlers console_direct+ --packages-select test_rclcpp --ctest-args -R gtest_subscription__rmw_connextdds

You may have to adjust the amount of stress on the machine, and you may have to run the test a few times, but it should fail fairly quickly.

Workarounds

Mark this test as xfail.

Additional context

This same test works fine on Fast-DDS (gtest_subscription__rmw_fastrtps_cpp) and Cyclone DDS (gtest_subscription__rmw_cyclonedds_cpp), even with load on the machine.

This has come up now because we are about to merge in ros2/rclcpp#2142, which seems to exacerbate this problem. However, I can produce this completely with rolling packages as of today, so it is not the fault of that PR.

I did some additional debugging to try to track this down. When it fails, the executor is waiting for new data to come in via rmw_wait via a condition variable:

timedout = !this->condition.wait_for(lock, n, on_condition_active);
. The failure occurs because the condition variable never wakes up and times out. In turn, this suggests that the condition variable is never triggered.

That condition variable should be triggered when new data comes in via the on_data_available callback in the subscriber:

RMW_Connext_DataReaderListener_on_data_available(
. I've verified that it is installed to the listener at .

Finally, I've verified that the publisher side is indeed writing the data out via DDS_DataWriter_write_untypedI in message (

DDS_DataWriter_write_untypedI(
) .

So as far as I can tell, those are all of the pieces necessary to get this working, and it does work sometimes. But under load, it seems to fail. I could use some advice on how to debug this further.

In the meantime, I'm going to propose a PR to mark that particular test as xfail so we can make progress on the other PRs.

@clalancette
Copy link
Contributor Author

I realized that due to a recent refactoring, we were no longer marking the test as xfail anymore. ros2/system_tests#544 should fix that.

@fujitatomoya
Copy link
Collaborator

i did not get to the root cause, https://github.com/ros2/rosbag2/blob/4882d30fc2d1e5b9305e5d46b8460466f9280d27/rosbag2_transport/test/rosbag2_transport/test_burst.cpp#L404-L408 looks like the same issue. when it comes to transient workload, it sometimes miss the subscription data, that leads to the test failure of that test here.

@Crola1702
Copy link

I think this issue is happening consistently on Windows repeated, and it's also flaky on windows debug and rhel repeated.

It's a timeout of this test.

Reference builds:

Test regression:

Log output:

  Start 12: gtest_multithreaded__rmw_connextdds

12: Test command: C:\Python38\python.exe "-u" "C:/ci/ws/install/share/ament_cmake_test/cmake/run_test.py" "C:/ci/ws/build/test_rclcpp/test_results/test_rclcpp/gtest_multithreaded__rmw_connextdds.gtest.xml" "--package-name" "test_rclcpp" "--output-file" "C:/ci/ws/build/test_rclcpp/ament_cmake_gtest/gtest_multithreaded__rmw_connextdds.txt" "--env" "RMW_IMPLEMENTATION=rmw_connextdds" "--append-env" "PATH=C:/ci/ws/build/test_rclcpp/Release" "--command" "C:/ci/ws/build/test_rclcpp/Release/gtest_multithreaded.exe" "--gtest_output=xml:C:/ci/ws/build/test_rclcpp/test_results/test_rclcpp/gtest_multithreaded__rmw_connextdds.gtest.xml"
12: Working Directory: C:/ci/ws/build/test_rclcpp
12: Test timeout computed to be: 90
12: -- run_test.py: extra environment variables:
12:  - RMW_IMPLEMENTATION=rmw_connextdds
12: -- run_test.py: extra environment variables to append:
12:  - PATH+=C:/ci/ws/build/test_rclcpp/Release
12: -- run_test.py: invoking following command in 'C:\ci\ws\build\test_rclcpp':
12:  - C:/ci/ws/build/test_rclcpp/Release/gtest_multithreaded.exe --gtest_output=xml:C:/ci/ws/build/test_rclcpp/test_results/test_rclcpp/gtest_multithreaded__rmw_connextdds.gtest.xml
12: Running main() from C:\ci\ws\install\src\gtest_vendor\src\gtest_main.cc
12: [==========] Running 5 tests from 1 test suite.
12: [----------] Global test environment set-up.
12: [----------] 5 tests from test_multithreaded
12: [ RUN      ] test_multithreaded.multi_consumer_single_producer
    Test #12: gtest_multithreaded__rmw_connextdds .............................***Timeout  90.01 sec

Flaky ratio:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-08-21 2024-08-06 13 13 100.0
nightly_win_deb 2024-08-20 2024-08-20 14 1 7.14
nightly_linux-rhel_repeated 2024-08-18 2024-08-14 16 2 12.5
nightly_linux_coverage 2024-08-07 2024-08-07 15 1 6.67

WDYT @clalancette?

@Crola1702
Copy link

Reference build: https://ci.ros2.org/job/nightly_win_rep/3504/

Failing test: test_multithreaded.multi_access_publisher with ConnextDDS

Log output
[ RUN      ] test_multithreaded.multi_access_publisher
[WARN] [1728461640.564957000] [rclcpp]: logging was initialized more than once
Waited 0 microseconds for the subscriber to connect to topic 'multi_access_publisher'
Publishing message 0
Publishing message 1
Publishing message 2
Publishing message 3
Publishing message 4
Publishing message 5
Publishing message 6
Publishing message 7
Publishing message 8
Publishing message 9
Publishing message 10
Publishing message 11
Subscription callback 0
callback() 0 with message data 7
Publishing message 12
Subscription callback 1
callback() 1 with message data 8
Subscription callback 2
callback() 2 with message data 9
Publishing message 13
Subscription callback 3
callback() 3 with message data 10
Publishing message 14
Subscription callback 4
callback() 4 with message data 11
Subscription callback 5
callback() 5 with message data 12
Subscription callback 6
callback() 6 with message data 13
Publishing message 15
Subscription callback 7
callback() 7 with message data 14
Publishing message 16
Subscription callback 8
callback() 8 with message data 15
Publishing message 17
Subscription callback 9
callback() 9 with message data 16
Publishing message 18
Subscription callback 10
callback() 10 with message data 17
Subscription callback 11
callback() 11 with message data 18
Publishing message 19
Subscription callback 12
callback() 12 with message data 19
C:\ci\ws\src\ros2\system_tests\test_rclcpp\test\test_multithreaded.cpp(421): error: Expected equality of these values:
  num_messages
    Which is: 20
  subscription_counter
    Which is: 13

[  FAILED  ] test_multithreaded.multi_access_publisher (5521 ms)

I see this error is happening mostly on windows. These started failing a lot more since testing parallelization in Feb 25th. Check:

Flaky ratio in the last 15 days:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-10-09 2024-09-25 10 10 100.0
nightly_win_deb 2024-10-09 2024-09-27 14 7 50.0
nightly_linux_repeated 2024-10-05 2024-10-05 13 1 7.69
nightly_win_rel 2024-09-28 2024-09-28 13 1 7.69
Flaky ratio since paralelize testing (277 days in oct 09):
job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-10-09 2024-02-25 229 167 72.93
nightly_win_deb 2024-10-09 2024-02-25 209 86 41.15
nightly_linux_repeated 2024-10-05 2024-04-09 246 7 2.85
nightly_win_rel 2024-09-28 2024-03-04 262 11 4.2
nightly_linux-rhel_repeated 2024-08-18 2024-03-15 265 7 2.64
nightly_linux_coverage 2024-08-07 2024-03-29 224 12 5.36
nightly_linux_debug 2024-04-07 2024-04-01 247 5 2.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants