-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation faults and uninitialised wait sets #478
Comments
Hi @gbiggs , @rsanchez15 and I have been looking at the reported issue and have arrived to the following conclusions:
I would argue that the error manifested in Finally, it makes sense that the error did not manifest before Foxy, since it was Foxy that introduced the possibility of having several nodes under the same context sharing a DDS participant. |
@gbiggs can you share a traceback of the failure? |
I agree that it sounds like a use after destroy issue where the context is getting destroyed while a node or wait set is still being used. A traceback would be nice, but also a multi-threaded traceback (the traceback from all the threads, ala |
Thanks everyone for your quick response! I'm attaching a stack trace (GitHub won't let me attach the core file itself). I'm also attaching a Compile and run with Fast DDS
Run with CycloneDDS
|
Thanks for the information, @EduPonz. It's possible that the error is an application-level race condition, because the test in question creates a lot of threads (I get 35 in my tracebacks, and only a few are from the ROS infrastructure). I will have another go at creating a cut-down example that's easier to trace through. I'd appreciate it if @ivanpauno or @wjwwood have time to either look for the place where the context is being used before the node, or give me some pointers on how to find it to speed up my search. Even if this is caused by the application, I think we need to have a better check to make sure a context is not being shut down too early. It shouldn't get to the point of segfaulting. |
Yeah, that will help a lot.
Checking at the code, you can get a segfault if the context is finalized while a node created from the same context is still alive (you have to make sure that all nodes are destroyed before calling node destroy). It would be easy to modify rmw_context_fini fastrtps implementation to leak and fail in the case not all nodes were destroyed (that's better than generating a segfault). |
@gbiggs and I had a chat earlier today and we talked about exactly this. We at eProsima are already working on the implementation. We actually see 3 different errors to account for:
I think that having these in place may help debugging the issue, since we could have a better idea of what of the three cases is causing it. @imontesino will give an update on the progress here later this week. |
I'm going to close this now, as we're fairly certain we know that the Fast DDS-related part of the problem has been fixed by #486. The remainder is on our side, with nodes hanging around longer than their contexts. |
Bug report
Required Info:
Steps to reproduce issue
fastdds_segfaults
branchrmf_fleet_adapter
package:Expected behavior
The sample program completes successfully without any errors.
Actual behavior
The sample program, in most iterations after the first couple, either fails to delete a wait set or causes segmentation faults in
rmw_fastrtps_cpp
code.Example output:
Additional information
We have traced both errors to the
node_listener
function inlisten_thread.cpp
.For the wait set deletion failure, the error occurs when the
context
is deallocated and a new one allocated in the same memory before thenode_listen
function returns. It tries to delete a wait set pointer that is null, and the null pointer check inrmw_fastrtps_shared_cpp::__rmw_destroy_wait_set
catches the null pointer and returns an error, triggering the error message.The segmentation fault has a similar cause. The
context
is deallocated and a new one allocated in the same memory. This time it tries to use a member of the zero-initialised context, which is a null pointer, which triggers a segmentation fault.In both cases, we have not been able to trace where the context is being overwritten. Both errors appear to be race conditions, and as far as we can tell they are occurring inside the
rmw_fastrtps_cpp
code.The sample program is a cut-down version of a test we have that used to work on the version of Fast RTPS that was in Eloquent, and started failing with the shift to Fast DDS in Foxy. It starts up several threads to handle messages in ROS at the
rclcpp
level, and the test itself hammers the ROS initialisation and finalisation machinery, creating and destroying contexts constantly and rapidly.The text was updated successfully, but these errors were encountered: