Segmentation faults and uninitialised wait sets #478

gbiggs · 2020-11-04T04:21:23Z

Bug report

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- Binaries, from source
Version or commit hash:
- Binaries: 1.2.1-1focal.20201007.210239
- Source: f54c74b
DDS implementation:
- rmw_fastrtps_cpp
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Check out the rmf_core repository into a workspace: https://github.com/osrf/rmf_core
Switch to the fastdds_segfaults branch

Compile the rmf_fleet_adapter package:

rosdep install --from-paths src --ignore-src -yr
colcon build --packages-up-to rmf_fleet_adapter

Execute the small program that reliably triggers the segmentation fault
```
source install/setup.bash
./build/rmf_fleet_adapter/segfaulter
```

Expected behavior

The sample program completes successfully without any errors.

Actual behavior

The sample program, in most iterations after the first couple, either fails to delete a wait set or causes segmentation faults in rmw_fastrtps_cpp code.

Example output:

$ ./build/rmf_fleet_adapter/segfaulter   
0
[INFO] [1604462249.727237249] [test_node_0]: Added a robot named [test_robot] with participant ID [0]
1
[INFO] [1604462251.764308128] [test_node_1]: Added a robot named [test_robot] with participant ID [0]
2
[INFO] [1604462253.797161121] [test_node_2]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
3
[INFO] [1604462255.833644587] [test_node_3]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
4
[INFO] [1604462257.876494584] [test_node_4]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
5
[INFO] [1604462259.924830354] [test_node_5]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
6
[INFO] [1604462261.974901964] [test_node_6]: Added a robot named [test_robot] with participant ID [0]
7
[INFO] [1604462264.032900604] [test_node_7]: Added a robot named [test_robot] with participant ID [0]
8
[INFO] [1604462266.085180435] [test_node_8]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
zsh: segmentation fault (core dumped)  ./build/rmf_fleet_adapter/segfaulter

Additional information

We have traced both errors to the node_listener function in listen_thread.cpp.

For the wait set deletion failure, the error occurs when the context is deallocated and a new one allocated in the same memory before the node_listen function returns. It tries to delete a wait set pointer that is null, and the null pointer check in rmw_fastrtps_shared_cpp::__rmw_destroy_wait_set catches the null pointer and returns an error, triggering the error message.

The segmentation fault has a similar cause. The context is deallocated and a new one allocated in the same memory. This time it tries to use a member of the zero-initialised context, which is a null pointer, which triggers a segmentation fault.

In both cases, we have not been able to trace where the context is being overwritten. Both errors appear to be race conditions, and as far as we can tell they are occurring inside the rmw_fastrtps_cpp code.

The sample program is a cut-down version of a test we have that used to work on the version of Fast RTPS that was in Eloquent, and started failing with the shift to Fast DDS in Foxy. It starts up several threads to handle messages in ROS at the rclcpp level, and the test itself hammers the ROS initialisation and finalisation machinery, creating and destroying contexts constantly and rapidly.

The text was updated successfully, but these errors were encountered:

EduPonz · 2020-11-04T16:00:23Z

Hi @gbiggs ,

@rsanchez15 and I have been looking at the reported issue and have arrived to the following conclusions:

Indeed the context is being reset while the node_listener function is waiting on the wait set. The crash happens due to the context identifier being different from the wait set identifier. On printing the error, the context identifier is not properly initialized, which causes a problem on the access.
The error is not rmw_fastrtps related per-se. We have run RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/rmf_fleet_adapter/segfaulter with similar results. It fails in ~rmw_context_impl_t() with the following output:
```
Not all nodes were finished before finishing the context
.Ensure `rcl_node_fini` is called for all nodes before `rcl_context_fini`,to avoid leaking.
```
This behaviour is consistent with the rmw_fastrpts_cpp one, i.e. a context is being reset while some of its nodes are still running.
The call that reaches the RMW implementations seems to originate in rclcpp, more precisely in __delete_context.

I would argue that the error manifested in rmw_fastrtps_cpp (and cyclonedds) is originated by resources not correctly managed up the stack (either in rcl or the application levels). I'd like @ivanpauno and @wjwwood to take a look at this issue, since they are the experts in all the context interfaces. Also, it probably makes sense to move this issue to a more suitable repository once the true origin is addressed.

Finally, it makes sense that the error did not manifest before Foxy, since it was Foxy that introduced the possibility of having several nodes under the same context sharing a DDS participant.

ivanpauno · 2020-11-04T17:44:41Z

@gbiggs can you share a traceback of the failure?

wjwwood · 2020-11-04T19:12:46Z

I agree that it sounds like a use after destroy issue where the context is getting destroyed while a node or wait set is still being used. A traceback would be nice, but also a multi-threaded traceback (the traceback from all the threads, ala thread apply all bt) might be needed too.

EduPonz · 2020-11-05T08:24:36Z

Thanks everyone for your quick response!

I'm attaching a stack trace (GitHub won't let me attach the core file itself). I'm also attaching a rmf_fleet_adapter.txt. This file is the .repos file of my workspace obtained with vcs export --exact src > rmf_fleet_adapter.repos, I just renamed it to .txt to be able to attach it. The steps I took for reproducing the issue where:

Compile and run with Fast DDS

mkdir -p ros2_foxy/src
cd ros2_foxy
wget https://github.com/ros2/rmw_fastrtps/files/5492603/rmf_fleet_adapter.txt
mv rmf_fleet_adapter.txt rmf_fleet_adapter.repos
vcs import src < rmf_fleet_adapter.repos
rosdep install --from-paths src --ignore-src -yr
colcon build --packages-up-to rmf_fleet_adapter --cmake-args "-DCMAKE_BUILD_TYPE=Debug"
source install/setup.bash
./build/rmf_fleet_adapter/segfaulter

Run with CycloneDDS

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/rmf_fleet_adapter/segfaulter

core.txt
rmf_fleet_adapter.txt

gbiggs · 2020-11-10T03:22:54Z

Thanks for the information, @EduPonz. It's possible that the error is an application-level race condition, because the test in question creates a lot of threads (I get 35 in my tracebacks, and only a few are from the ROS infrastructure). I will have another go at creating a cut-down example that's easier to trace through. I'd appreciate it if @ivanpauno or @wjwwood have time to either look for the place where the context is being used before the node, or give me some pointers on how to find it to speed up my search.

Even if this is caused by the application, I think we need to have a better check to make sure a context is not being shut down too early. It shouldn't get to the point of segfaulting.

ivanpauno · 2020-11-10T15:17:21Z

I will have another go at creating a cut-down example that's easier to trace through

Yeah, that will help a lot.
I took a look at the mt traceback that @EduPonz shared, and I'm not quite sure of what's going on.

I'd appreciate it if @ivanpauno or @wjwwood have time to either look for the place where the context is being used before the node, or give me some pointers on how to find it to speed up my search.

Checking at the code, you can get a segfault if the context is finalized while a node created from the same context is still alive (you have to make sure that all nodes are destroyed before calling node destroy).
So maybe there's something wrong in how we're managing the lifetimes of context and node in rclcpp.

It would be easy to modify rmw_context_fini fastrtps implementation to leak and fail in the case not all nodes were destroyed (that's better than generating a segfault).
You would only need to check if this count is zero before finishing the context.

EduPonz · 2020-11-10T15:47:34Z

It would be easy to modify rmw_context_fini fastrtps implementation to leak and fail in the case not all nodes were destroyed (that's better than generating a segfault).
You would only need to check if this count is zero before finishing the context.

@gbiggs and I had a chat earlier today and we talked about exactly this. We at eProsima are already working on the implementation. We actually see 3 different errors to account for:

Calling rmw_context_fini without having shutdown the context first
Calling context shutdown with active nodes (this is where we'll look at the count)
Creating a node on a shutdown context

I think that having these in place may help debugging the issue, since we could have a better idea of what of the three cases is causing it.

@imontesino will give an update on the progress here later this week.

EduPonz · 2020-11-18T07:00:47Z

Hi @gbiggs ,

We have replicated the issue using #486 and we get a clean closing now. The problem seems to be that contexts with active nodes are being shut down. I think that this gives enough clues as whether where to follow the haunting of the race condition

gbiggs · 2020-12-03T00:50:50Z

I'm going to close this now, as we're fairly certain we know that the Fast DDS-related part of the problem has been fixed by #486. The remainder is on our side, with nodes hanging around longer than their contexts.

ivanpauno added the bug Something isn't working label Nov 4, 2020

imontesino mentioned this issue Nov 17, 2020

Check for correct context shutdown #486

Merged

gbiggs closed this as completed Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation faults and uninitialised wait sets #478

Segmentation faults and uninitialised wait sets #478

gbiggs commented Nov 4, 2020

EduPonz commented Nov 4, 2020

ivanpauno commented Nov 4, 2020

wjwwood commented Nov 4, 2020

EduPonz commented Nov 5, 2020

gbiggs commented Nov 10, 2020

ivanpauno commented Nov 10, 2020

EduPonz commented Nov 10, 2020 •

edited

Loading

EduPonz commented Nov 18, 2020

gbiggs commented Dec 3, 2020

Segmentation faults and uninitialised wait sets #478

Segmentation faults and uninitialised wait sets #478

Comments

gbiggs commented Nov 4, 2020

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

EduPonz commented Nov 4, 2020

ivanpauno commented Nov 4, 2020

wjwwood commented Nov 4, 2020

EduPonz commented Nov 5, 2020

Compile and run with Fast DDS

Run with CycloneDDS

gbiggs commented Nov 10, 2020

ivanpauno commented Nov 10, 2020

EduPonz commented Nov 10, 2020 • edited Loading

EduPonz commented Nov 18, 2020

gbiggs commented Dec 3, 2020

EduPonz commented Nov 10, 2020 •

edited

Loading