Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eloquent] Ensure rcl_clock accesses are thread-safe (ABI-safe version). #1056

Closed
wants to merge 4 commits into from

Conversation

clalancette
Copy link
Contributor

This is an alternative PR to #999
The code in this one is less nice, but it is ABI-safe. Thus,
it can be used as a short-term workaround for users hurt by
the problem now, and/or used as the solution for Eloquent.

Signed-off-by: Chris Lalancette clalancette@openrobotics.org

I'm still testing this, so consider this a request for comment on the approach. Once I run a full CI on it, I'll be more confident in it.

This is an alternative PR to #999
The code in this one is less nice, but it is ABI-safe.  Thus,
it can be used as a short-term workaround for users hurt by
the problem now, and/or used as the solution for Eloquent.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
@clalancette
Copy link
Contributor Author

CI on Linux (all packages, all tests, Eloquent/bionic): Build Status

@dirk-thomas dirk-thomas changed the title Ensure rcl_clock accesses are thread-safe (ABI-safe version). [eloquent] Ensure rcl_clock accesses are thread-safe (ABI-safe version). Apr 9, 2020
Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, with one issue raised.

rclcpp/src/rclcpp/clock.cpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/clock.cpp Outdated Show resolved Hide resolved
Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
@clalancette clalancette marked this pull request as ready for review April 10, 2020 15:26
rclcpp/src/rclcpp/clock.cpp Outdated Show resolved Hide resolved
Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
@clalancette
Copy link
Contributor Author

clalancette commented Apr 13, 2020

CI on Linux (all packages, all tests, Eloquent/bionic): Build Status

For comparison, a build before this PR (all packages, all tests, Eloquent/bionic): Build Status

@clalancette
Copy link
Contributor Author

@ros-pull-request-builder retest this please

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and CI's happy. I think this one's ready to go.

@clalancette
Copy link
Contributor Author

@mjcarroll correctly pointed out that https://ci.ros2.org/job/ci_linux/10124/testReport/junit/(root)/test_rclcpp/gtest_spin__rmw_connext_cpp_gtest_missing_result/ looks suspiciously like a regression from this patch. I wasn't able to reproduce locally, but I think it needs more looking into.

@clalancette
Copy link
Contributor Author

After doing a bit of debugging here, it turns out that there are some accesses to the TimerBase object that don't go through the Clock constructor, so they never get a new entry in the global mutex map assigned. I'm not sure how that can happen (the TimerBase constructor takes a rclcpp::Clock::SharedPtr), but it seems to be the cause of the crash we are seeing in CI. Needs more investigation.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
@clalancette
Copy link
Contributor Author

Thanks to some help from @hidmic, I think we've solved the crash. Here is another CI on Linux (all packages, all tests, Eloquent/Bionic), to be sure: Build Status

@clalancette
Copy link
Contributor Author

There are 194 failures in the Linux build. If I filter out all of the flake8 and pep257 ones, that leaves me with the following list:

    projectroot.test_publisher_subscriber__rclpy__rmw_connext_cpp__rmw_cyclonedds_cpp
    projectroot.test_publisher_subscriber__rclpy__rclcpp__rmw_connext_cpp__rmw_cyclonedds_cpp
    projectroot.test_publisher_subscriber__rclcpp__rclpy__rmw_connext_cpp__rmw_cyclonedds_cpp
    projectroot.test_publisher_subscriber__rclcpp__rmw_connext_cpp__rmw_cyclonedds_cpp
    projectroot.test_events__rmw_cyclonedds_cpp
    rcl.TestEventFixture__rmw_cyclonedds_cpp.test_pubsub_deadline_missed
    rcl.TestEventFixture__rmw_cyclonedds_cpp.test_pubsub_liveliness_kill_pub
    test_communication.TestPublisherSubscriberAfterShutdown.test_processes_finished_gracefully[WStrings]
    test_communication.TestPublisherSubscriberAfterShutdown.test_processes_finished_gracefully[WStrings]
    test_communication.TestPublisherSubscriberAfterShutdown.test_processes_finished_gracefully[WStrings]
    test_communication.TestPublisherSubscriberAfterShutdown.test_processes_finished_gracefully[WStrings]

I believe that the WString interoperability with Connext is a known issue, so we can ignore those. That leaves us with:

    rcl.TestEventFixture__rmw_cyclonedds_cpp.test_pubsub_deadline_missed
    rcl.TestEventFixture__rmw_cyclonedds_cpp.test_pubsub_liveliness_kill_pub

But if memory serves correctly, the Cyclone version in Eloquent didn't support liveliness or deadline. So overall, I think we're pretty good to go here. I'm going to run one last CI on the other platforms, then go ahead and merge this.

@clalancette
Copy link
Contributor Author

clalancette commented Jun 17, 2020

  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor Author

Windows "failed" because it had too many test failures. But I honestly don't know what I should expect for failures there. Here's a Windows CI run without this patch:

  • Windows Build Status

@clalancette
Copy link
Contributor Author

While we could potentially slip this into Eloquent, I think it is too risky given that Eloquent is going EOL in 4-6 weeks. I'm just going to close this out; the fix is in Foxy and later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants