Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults and uninitialised wait sets #478

Closed
gbiggs opened this issue Nov 4, 2020 · 9 comments
Closed

Segmentation faults and uninitialised wait sets #478

gbiggs opened this issue Nov 4, 2020 · 9 comments
Labels
bug Something isn't working

Comments

@gbiggs
Copy link
Member

gbiggs commented Nov 4, 2020

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • Binaries, from source
  • Version or commit hash:
    • Binaries: 1.2.1-1focal.20201007.210239
    • Source: f54c74b
  • DDS implementation:
    • rmw_fastrtps_cpp
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

  1. Check out the rmf_core repository into a workspace: https://github.com/osrf/rmf_core
  2. Switch to the fastdds_segfaults branch
  3. Compile the rmf_fleet_adapter package:
    rosdep install --from-paths src --ignore-src -yr
    colcon build --packages-up-to rmf_fleet_adapter
    
  4. Execute the small program that reliably triggers the segmentation fault
    source install/setup.bash
    ./build/rmf_fleet_adapter/segfaulter
    

Expected behavior

The sample program completes successfully without any errors.

Actual behavior

The sample program, in most iterations after the first couple, either fails to delete a wait set or causes segmentation faults in rmw_fastrtps_cpp code.

Example output:

$ ./build/rmf_fleet_adapter/segfaulter   
0
[INFO] [1604462249.727237249] [test_node_0]: Added a robot named [test_robot] with participant ID [0]
1
[INFO] [1604462251.764308128] [test_node_1]: Added a robot named [test_robot] with participant ID [0]
2
[INFO] [1604462253.797161121] [test_node_2]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
3
[INFO] [1604462255.833644587] [test_node_3]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
4
[INFO] [1604462257.876494584] [test_node_4]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
5
[INFO] [1604462259.924830354] [test_node_5]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
6
[INFO] [1604462261.974901964] [test_node_6]: Added a robot named [test_robot] with participant ID [0]
7
[INFO] [1604462264.032900604] [test_node_7]: Added a robot named [test_robot] with participant ID [0]
8
[INFO] [1604462266.085180435] [test_node_8]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
zsh: segmentation fault (core dumped)  ./build/rmf_fleet_adapter/segfaulter

Additional information

We have traced both errors to the node_listener function in listen_thread.cpp.

For the wait set deletion failure, the error occurs when the context is deallocated and a new one allocated in the same memory before the node_listen function returns. It tries to delete a wait set pointer that is null, and the null pointer check in rmw_fastrtps_shared_cpp::__rmw_destroy_wait_set catches the null pointer and returns an error, triggering the error message.

The segmentation fault has a similar cause. The context is deallocated and a new one allocated in the same memory. This time it tries to use a member of the zero-initialised context, which is a null pointer, which triggers a segmentation fault.

In both cases, we have not been able to trace where the context is being overwritten. Both errors appear to be race conditions, and as far as we can tell they are occurring inside the rmw_fastrtps_cpp code.

The sample program is a cut-down version of a test we have that used to work on the version of Fast RTPS that was in Eloquent, and started failing with the shift to Fast DDS in Foxy. It starts up several threads to handle messages in ROS at the rclcpp level, and the test itself hammers the ROS initialisation and finalisation machinery, creating and destroying contexts constantly and rapidly.

@EduPonz
Copy link

EduPonz commented Nov 4, 2020

Hi @gbiggs ,

@rsanchez15 and I have been looking at the reported issue and have arrived to the following conclusions:

  1. Indeed the context is being reset while the node_listener function is waiting on the wait set. The crash happens due to the context identifier being different from the wait set identifier. On printing the error, the context identifier is not properly initialized, which causes a problem on the access.

  2. The error is not rmw_fastrtps related per-se. We have run RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/rmf_fleet_adapter/segfaulter with similar results. It fails in ~rmw_context_impl_t() with the following output:

    Not all nodes were finished before finishing the context
    .Ensure `rcl_node_fini` is called for all nodes before `rcl_context_fini`,to avoid leaking.
    

    This behaviour is consistent with the rmw_fastrpts_cpp one, i.e. a context is being reset while some of its nodes are still running.

  3. The call that reaches the RMW implementations seems to originate in rclcpp, more precisely in __delete_context.

I would argue that the error manifested in rmw_fastrtps_cpp (and cyclonedds) is originated by resources not correctly managed up the stack (either in rcl or the application levels). I'd like @ivanpauno and @wjwwood to take a look at this issue, since they are the experts in all the context interfaces. Also, it probably makes sense to move this issue to a more suitable repository once the true origin is addressed.

Finally, it makes sense that the error did not manifest before Foxy, since it was Foxy that introduced the possibility of having several nodes under the same context sharing a DDS participant.

@ivanpauno ivanpauno added the bug Something isn't working label Nov 4, 2020
@ivanpauno
Copy link
Member

@gbiggs can you share a traceback of the failure?

@wjwwood
Copy link
Member

wjwwood commented Nov 4, 2020

I agree that it sounds like a use after destroy issue where the context is getting destroyed while a node or wait set is still being used. A traceback would be nice, but also a multi-threaded traceback (the traceback from all the threads, ala thread apply all bt) might be needed too.

@EduPonz
Copy link

EduPonz commented Nov 5, 2020

Thanks everyone for your quick response!

I'm attaching a stack trace (GitHub won't let me attach the core file itself). I'm also attaching a rmf_fleet_adapter.txt. This file is the .repos file of my workspace obtained with vcs export --exact src > rmf_fleet_adapter.repos, I just renamed it to .txt to be able to attach it. The steps I took for reproducing the issue where:

Compile and run with Fast DDS

mkdir -p ros2_foxy/src
cd ros2_foxy
wget https://github.com/ros2/rmw_fastrtps/files/5492603/rmf_fleet_adapter.txt
mv rmf_fleet_adapter.txt rmf_fleet_adapter.repos
vcs import src < rmf_fleet_adapter.repos
rosdep install --from-paths src --ignore-src -yr
colcon build --packages-up-to rmf_fleet_adapter --cmake-args "-DCMAKE_BUILD_TYPE=Debug"
source install/setup.bash
./build/rmf_fleet_adapter/segfaulter

Run with CycloneDDS

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/rmf_fleet_adapter/segfaulter

core.txt
rmf_fleet_adapter.txt

@gbiggs
Copy link
Member Author

gbiggs commented Nov 10, 2020

Thanks for the information, @EduPonz. It's possible that the error is an application-level race condition, because the test in question creates a lot of threads (I get 35 in my tracebacks, and only a few are from the ROS infrastructure). I will have another go at creating a cut-down example that's easier to trace through. I'd appreciate it if @ivanpauno or @wjwwood have time to either look for the place where the context is being used before the node, or give me some pointers on how to find it to speed up my search.

Even if this is caused by the application, I think we need to have a better check to make sure a context is not being shut down too early. It shouldn't get to the point of segfaulting.

@ivanpauno
Copy link
Member

I will have another go at creating a cut-down example that's easier to trace through

Yeah, that will help a lot.
I took a look at the mt traceback that @EduPonz shared, and I'm not quite sure of what's going on.

I'd appreciate it if @ivanpauno or @wjwwood have time to either look for the place where the context is being used before the node, or give me some pointers on how to find it to speed up my search.

Checking at the code, you can get a segfault if the context is finalized while a node created from the same context is still alive (you have to make sure that all nodes are destroyed before calling node destroy).
So maybe there's something wrong in how we're managing the lifetimes of context and node in rclcpp.

It would be easy to modify rmw_context_fini fastrtps implementation to leak and fail in the case not all nodes were destroyed (that's better than generating a segfault).
You would only need to check if this count is zero before finishing the context.

@EduPonz
Copy link

EduPonz commented Nov 10, 2020

It would be easy to modify rmw_context_fini fastrtps implementation to leak and fail in the case not all nodes were destroyed (that's better than generating a segfault).
You would only need to check if this count is zero before finishing the context.

@gbiggs and I had a chat earlier today and we talked about exactly this. We at eProsima are already working on the implementation. We actually see 3 different errors to account for:

  1. Calling rmw_context_fini without having shutdown the context first
  2. Calling context shutdown with active nodes (this is where we'll look at the count)
  3. Creating a node on a shutdown context

I think that having these in place may help debugging the issue, since we could have a better idea of what of the three cases is causing it.

@imontesino will give an update on the progress here later this week.

@EduPonz
Copy link

EduPonz commented Nov 18, 2020

Hi @gbiggs ,

We have replicated the issue using #486 and we get a clean closing now. The problem seems to be that contexts with active nodes are being shut down. I think that this gives enough clues as whether where to follow the haunting of the race condition

@gbiggs
Copy link
Member Author

gbiggs commented Dec 3, 2020

I'm going to close this now, as we're fairly certain we know that the Fast DDS-related part of the problem has been fixed by #486. The remainder is on our side, with nodes hanging around longer than their contexts.

@gbiggs gbiggs closed this as completed Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants