Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Fix topic creation race condition #442

Merged
merged 5 commits into from
Jul 20, 2020

Conversation

ivanpauno
Copy link
Member

We were checking if the topic was already creating and then creating it if not.
There's a TOCTTOU race condition here.

To avoid the problem, I've added a mutex for each participant, and locking it when creating a topic.

The race condition could be experienced when launching gazebo:

ros2 launch gazebo_ros gzserver.launch.py

though it was extremely hard to reproduce (I was never able to reproduce it, but @jacobperron was able to).

The failure looked like:

  • publisher failing
    [gzserver-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
    [gzserver-1]   what():  could not create publisher: failed to create topic '/clock' for node namespace='/' name='gazebo', at /home/jacob/ws/ros/gazebo_ros/src/ros2/rmw_connext/rmw_connext_cpp/src/rmw_publisher.cpp:208, at /home/jacob/ws/ros/gazebo_ros/src/ros2/rcl/rcl/src/rcl/publisher.c:172
    [gzserver-1] Aborted (core dumped)
  • subscription failing
[gzserver-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[gzserver-1]   what():  could not create subscription: failed to create topic '/clock' for node namespace='/' name='gazebo', at /home/jacob/ws/ros/gazebo_ros/src/ros2/rmw_connext/rmw_connext_cpp/src/rmw_subscription.cpp:202, at /home/jacob/ws/ros/gazebo_ros/src/ros2/rcl/rcl/src/rcl/subscription.c:168

I haven't confirmed that gazebo is trying to create the publisher and subscription from different threads, but it sounds like that was the case.

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno added bug Something isn't working in review Waiting for review (Kanban column) labels Jul 15, 2020
@ivanpauno ivanpauno self-assigned this Jul 15, 2020
@ivanpauno
Copy link
Member Author

ivanpauno commented Jul 15, 2020

I was curious if there was a similar race condition for services and clients.
That code lives here and here, and there doesn't seem to be any race condition in that case 😃 .

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno closed this Jul 15, 2020
@ivanpauno ivanpauno reopened this Jul 15, 2020
@ivanpauno
Copy link
Member Author

@ros-pull-request-builder retest this please

@ivanpauno
Copy link
Member Author

  • Linux Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Member

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to have fixed the issue for me 👍

rmw_connext_cpp/src/rmw_publisher.cpp Show resolved Hide resolved
rmw_connext_cpp/src/rmw_subscription.cpp Show resolved Hide resolved
rmw_connext_shared_cpp/src/create_topic.cpp Show resolved Hide resolved
Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this same fix in the dynamic version of the code as well:

topic_description = participant->lookup_topicdescription(topic_str);
and
topic_description = participant->lookup_topicdescription(topic_str);

I know we don't really build or test it, but since we are in here fixing this we may as well do the same there.

rmw_connext_shared_cpp/src/create_topic.cpp Show resolved Hide resolved
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno
Copy link
Member Author

I think we need this same fix in the dynamic version of the code as well:

Done in 79aa945.

I really don't think we should continue updating rmw_connext_dynamic, it is extremely outdated.

IMO, if we want to support it again, it will be easier to rewrite than update it (the only piece of useful code there is how to create a DDS type code from a ROS typesupport and how to convert between ROS and DDS messages with the introspection typesupport).

@clalancette
Copy link
Contributor

I really don't think we should continue updating rmw_connext_dynamic, it is extremely outdated.

IMO, if we want to support it again, it will be easier to rewrite than update it (the only piece of useful code there is how to create a DDS type code from a ROS typesupport and how to convert between ROS and DDS messages with the introspection typesupport).

Yeah, I basically agree here. Personally I'd just remove the code from the repository altogether. If anyone wants to revive it, it would be in git history to serve as a starting point. I'll bring it up at the next meeting.

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll suggest letting @jacobperron take one more look at it before merging.

Copy link
Member

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ivanpauno ivanpauno merged commit dc3f59a into master Jul 20, 2020
@delete-merged-branch delete-merged-branch bot deleted the ivanpauno/topic-creation-race-condition branch July 20, 2020 13:22
@jacobperron
Copy link
Member

This change (and the follow-up fix in #444) are not ABI compatible due to changes to publicly accessible structs. I'm not sure of a nice way to make them backwards compatible so I'm removing from the Foxy board.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working in review Waiting for review (Kanban column)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants