Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in hardware interface when reading robot_description from topic #1442

Closed
firesurfer opened this issue Mar 14, 2024 · 21 comments · Fixed by #1451
Closed

Segfault in hardware interface when reading robot_description from topic #1442

firesurfer opened this issue Mar 14, 2024 · 21 comments · Fixed by #1451
Labels

Comments

@firesurfer
Copy link
Contributor

Describe the bug

I have a hardware interface for some piece of internal hardware. When I have the controller_manager obtain the robot_description from the robot_state_publisher instead of passing it as a parameter I get a segfault in the read method of the hardware interface. This behavior does not occur when starting it with gdb attached!

To Reproduce

Start the ros2_control_node configured to read the robot_description from the corresponding topic.

Expected behavior

Should not crash ;)

Environment (please complete the following information):

  • OS: Ubuntu 22.04
  • Version: Iron binary install - hardware_interface version: 3.24.0

Additional context

I already call the read method once in the activate method in order to obtain the current positions of the hardware. This call succeeds. But in the read method there is quite a bit of logic as I need to workaround some quirks of our hardware.

Stacktrace (I replaced the names):

[ros2_control_node-12] #7    Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x75c70952684f]
[ros2_control_node-12] #6    Source "./nptl/pthread_create.c", line 442, in start_thread [0x75c709494ac2]
[ros2_control_node-12] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x75c7098dc252, in 
[ros2_control_node-12] #4    Object "/opt/ros/iron/lib/controller_manager/ros2_control_node", at 0x58f626c40a08, in 
[ros2_control_node-12] #3    Object "/opt/ros/iron/lib/libcontroller_manager.so", at 0x75c70a0869c2, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #2    Object "/opt/ros/iron/lib/libhardware_interface.so", at 0x75c709eb3942, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #1    Object "/opt/ros/iron/lib/libhardware_interface.so", at 0x75c709ed524f, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #0    Object "Path to our hw interface plugin lib", at 0x75c6fc4a5da1, in NameOftheComponent...::read(rclcpp::Time const&, rclcpp::Duration const&)

Starting the node with GDB attached delays startup a tiny bit. So it might be somehow timing related.
In the hardware interface I access the URDF via system_info.original_xml and read some parameters such as joint limits from it.

@firesurfer firesurfer added the bug label Mar 14, 2024
@saikishor
Copy link
Member

Hello!

Earlier to these logs do you see any info about resource successfully initialized the components from the URDF from the topic before the crash?. Can you try using mock components and check if this is the case?

Thank you

@firesurfer
Copy link
Contributor Author

Hi,
I spent a bit more time on figuring out what happens. I get really inconsistent results:

  1. The described segfault - I get some log messages of the kind: [resource_manager]: Successful activate of hardware <hardware name>
  2. Sometimes I run into the error handler - the previous state is then inactive. I do not think this is due to our code in the hardware interface. As every case where I would return Failure/Error as a return code, does also log it.
  3. A successful start

I get one of these behaviors at each startup. Adding /removing additional log statements changes which behavior I get most of the time. (Passing the robot_description as a parameter always results in a successful start).
For me this looks like some kind of race condition / threading issue.

Using mocked hardware instead this issue does not exist.

I have the same hardware interface loaded 4 times btw. The error occurs in a differnt one each time.

@saikishor
Copy link
Member

Ok, based on what you said, can you add the following condition if (resource_manager_->is_urdf_already_loaded()) at the beginning of the read and write cycles of the resource_manager? and see it it happens again. So, basically until the whole URDF and components are loaded you tend to skip the read and write cycle. Let's see if this fixes your issue

@firesurfer
Copy link
Contributor Author

I guess I have to compile the resource manager myself then ? I will try that.

@firesurfer
Copy link
Contributor Author

firesurfer commented Mar 14, 2024

Could you please point me out where I have to add this exactly ? I am not too familiar with the ros2 control code base

@saikishor
Copy link
Member

Hello @firesurfer !

You can tryout this branch: https://github.com/saikishor/ros2_control/tree/robot_description/crash/iron and let us know.

@firesurfer
Copy link
Contributor Author

@saikishor Thanks for your help. I probably won't be able to try it out this week but I will definitely find some time next week.

@saikishor
Copy link
Member

Sure. No worries

@firesurfer
Copy link
Contributor Author

firesurfer commented Mar 18, 2024

@saikishor Just tried your patch.
It seems to solve the segfault issue in the read method. But in like 1 of 10 starts one of the hardware interfaces goes into its on_error error handler with the previous state: inactive which is interesting as I log at the end of the on_activate an "Activate finished" message just before I return SUCCESS

@saikishor
Copy link
Member

@firesurfer thanks for confirming it. Looking at the code last night, I think I might know the root cause of the issue. In 1 hour, I can send you another branch to test, maybe it might solve the issue

@saikishor
Copy link
Member

Hello @firesurfer,

Could you test this branch: https://github.com/saikishor/ros2_control/tree/fix/iron/missing/resources_lock and let me know if this fixes your issue. Ideally, it should as the issue might be coming from some mutexes

@firesurfer
Copy link
Contributor Author

firesurfer commented Mar 18, 2024

@saikishor Just tried your branch and it introduces a new segfault in the read method:

[ros2_control_node-12] #7    Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x796d3cd2684f]
[ros2_control_node-12] #6    Source "./nptl/pthread_create.c", line 442, in start_thread [0x796d3cc94ac2]
[ros2_control_node-12] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x796d3d0dc252, in 
[ros2_control_node-12] #4    Object "/install/controller_manager/lib/controller_manager/ros2_control_node", at 0x57819bbb53a4, in main::{lambda()#1}::operator()() const
[ros2_control_node-12] #3    Object "/install/controller_manager/lib/libcontroller_manager.so", at 0x796d3d875d32, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #2    Object "/install/hardware_interface/lib/libhardware_interface.so", at 0x796d3d3a71f4, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #1    Object "/install/hardware_interface/lib/libhardware_interface.so", at 0x796d3d3e2d0f, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #0    Object "<path>/my_component.so", at 0x796d3821dfe1, in <my component>::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] Segmentation fault (Address not mapped to object [0x60])

Starting it again and it worked. So the race condition / threading issue still seems to be there

EDIT: Looks a bit like it works roughly every second time ;)

@saikishor
Copy link
Member

Lol, so now it works once in every 2 runs. Right before this segmentation fault, do you have more logs on activation of the hardware components?

@firesurfer
Copy link
Contributor Author

Soo this time I had in 8 startups:

3x on_error -> Log was before the on_activate finished but on activate did print Activate finished

1x on_error > Log was after on_activate -> did print Activate finished before.

1x successful

2x Segfault -> All hardware interface activated successfully

1x on_error-> Log was after all hardware interfaces reported successful activation

Looks like I was just lucky when I said it works like every 2 runs....

@saikishor
Copy link
Member

@firesurfer I've pushed another similar commit to the same branch, At least with this the read and write cycles shouldn't run unless the hardware components are loaded and initialized, I've added some logging as well for easier reference. Can you please pull the change and check?

@firesurfer
Copy link
Contributor Author

@saikishor I pulled this commit: d98c0ee

3x on_error I see before activation of the hw interface: [ros2_control_node-12] [INFO] [1710763241.386743179] [resource_manager]: Releasing the resources lock as hardware is initialized! +

[ros2_control_node-12] [INFO] [1710763241.387091189] [resource_manager]: Successful 'configure' of hardware 'my hardware'
[ros2_control_node-12] [INFO] [1710763241.387106778] [resource_manager]: 'activate' hardware 'my hardware' 

1x Segfault after everything was successfully activated

1x on_error but print out after [ERROR] [ros2_control_node-12]: process has died (Could because one of the output streams is buffered and the other is not?!)

Interestingly it tells me something about: [ros2_control_node-12] Stack trace (most recent call last) in thread 2751259: but does not print one afterwards

1x Segfault but with complaining about empty activate / deactivate list (we have node in our system which tries to disable all our motion controllers under certain conditions )

[ros2_control_node-12] [INFO] [1710763734.794714775] [controller_manager]: Empty activate and deactivate list, not requesting switch
[ros2_control_node-12] #7 Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x768c97b2684f]
[ros2_control_node-12] #6 Source "./nptl/pthread_create.c", line 442, in start_thread [0x768c97a94ac2]
[ros2_control_node-12] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x768c97edc252, in
[ros2_control_node-12] #4 Object "install/controller_manager/lib/controller_manager/ros2_control_node", at 0x56f2f2f553a4, in main::{lambda()#1}::operator()() const
[ros2_control_node-12] #3 Object "/install/controller_manager/lib/libcontroller_manager.so", at 0x768c9871cd32, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #2 Object "install/hardware_interface/lib/libhardware_interface.so", at 0x768c984e71f4, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #1 Object "install/hardware_interface/lib/libhardware_interface.so", at 0x768c98522fef, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #0 Object "install/scicore_driver/lib/libscicore_driver_ros_plugin.so", at 0x768c940c3fe1, in mycomponent::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] Segmentation fault (Address not mapped to object [0x60])

@saikishor
Copy link
Member

then the problem might be coming from here

// unconfigured (loaded only)
set_components_to_state(
"hardware_components_initial_state.unconfigured",
rclcpp_lifecycle::State(
State::PRIMARY_STATE_UNCONFIGURED, hardware_interface::lifecycle_state_names::UNCONFIGURED));
// inactive (configured)
set_components_to_state(
"hardware_components_initial_state.inactive",
rclcpp_lifecycle::State(
State::PRIMARY_STATE_INACTIVE, hardware_interface::lifecycle_state_names::INACTIVE));
// activate all other components
for (const auto & [component, state] : components_to_activate)
{
rclcpp_lifecycle::State active_state(
State::PRIMARY_STATE_ACTIVE, hardware_interface::lifecycle_state_names::ACTIVE);
resource_manager_->set_component_state(component, active_state);
}

bool set_component_state(HardwareT & hardware, const rclcpp_lifecycle::State & target_state)

This is executed right after the hardware components are loaded from the URDF method in the load_urdf method.

@saikishor
Copy link
Member

@firesurfer I've pushed another commit to the same branch. Can you test it please?

@firesurfer
Copy link
Contributor Author

firesurfer commented Mar 19, 2024

Good news @saikishor . I did like 10 starts and all of them were successful :)

@saikishor
Copy link
Member

Awesome. I'm glad that the changes worked. I would then try to open a PR to have the fix upstream

@saikishor
Copy link
Member

@firesurfer If you have time, please review the changes in the above linked PR.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants