-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
malloc() : corrupted top size
error in CI on build.osrfoundation.org and build.ros2.org
#1296
Comments
I was able to reproduce the error locally and got the following backtrace:
|
I can also reproduce the failure using
|
here is a backtrace:
|
I have tried building |
I've been playing around with this as well. Here are some observations:
which reproduces about 4% of the time. The other is:
which only happened once.
|
I can confirm that it's a flaky failure, though I haven't quantified it to this extent
I see this error frequently in osrf/gazebo test failures. It is mentioned in gazebosim/gazebo-classic#2019, though I'm not sure that issue complete captures the issue. At any rate, I don't think it is a ros or gazebo_ros problem
this sounds like an osrf/gazebo issue I've also seen one failure with the following message:
|
I also haven't been able to reproduce the error with valgrind |
I was finally able to get a failure with asan (after about an hour of running the test). For posterity, here's what I'm seeing:
It's not in exactly the same place as the original issue, but of course ASAN is messing with all kinds of malloc-related stuff to do its work. So this may be the crash. Now I need to try to figure out what is going on with it. |
I was also able to get this to fail by building my galactic workspace from source, but remembering to build with:
So that should give me a slightly better information when it does crash. |
All right, after some additional poking this morning, I think this is a problem in Gazebo, not in the ROS 2 libraries. First of all, my reproduction consists of building a Galactic workspace with With all of that in place, I was able to directly run one of the failing tests,
Letting that run usually reproduces the error very quickly. I then went into the code for test_gazebo_ros_init, and modified it to remove almost everything except for the initialization of gazebo. That is, the code I ended up with looks like:
When I rebuilt and ran that in the same loop, it took longer to reproduce the problem, but I could still reproduce. I then went even further and removed the linking of the ROS 2 libraries while building (i.e. I removed https://github.com/ros-simulation/gazebo_ros_pkgs/blob/aditya_metrics/gazebo_ros/test/CMakeLists.txt#L62-L70), just to make sure there was no C++ initialization problems. When I rebuilt and ran with that, it still takes a while, but it does eventually fail with the So I think it is some combination of the |
OK, with some further poking, I figured out that just calling But I also realized that what this is doing is loading a shared library that does link against ROS 2. So I still don't know whether the issue is in Gazebo or ROS 2, though I've reduced the search space. I'm continuing to debug. |
yeah, I was just about to point out that I'll also point out that I've never seen this failure in the |
@scpeters just to support, I've run the test more than 1500 times with foxy, and it never failed with the malloc() error. Also, not sure if this is relevant, but debain foxy (which never failed) is running 3.5.3 for gazebo_(dev,msgs, plugins, ros, ros_pkgs) whereas debain galactic is running 3.5.2. Is galactic running an older version ? Following is the strace output when it fails:
|
Commenting out https://github.com/ros2/rclcpp/blob/galactic/rclcpp/src/rclcpp/time_source.cpp#L129-L133 seems to fix the issue for me. However, I don't really have any idea why, so I still need to do further debugging here. |
@clalancette supporting your point that |
Going further, the problem seems to stem from this call: https://github.com/ros2/rclcpp/blob/galactic/rclcpp/src/rclcpp/time_source.cpp#L254-L257 . If I comment out just that line (and the other couple of lines that reference the |
All right, I'm fairly sure that this is unprotected access to the callback_groups_ variable from multiple threads. When I add in locks around that, the problem seems to go away (though I'm still testing). While we are in here, we should probably deprecate or remove get_callback_groups; there isn't really a thread-safe way to access it. |
See ros2/rclcpp#1723, which should fix this issue in Rolling. This problem doesn't directly exist in Foxy, since we don't have threaded clock handling there. It could exist in theory if some other code manipulated callback groups from multiple threads, but it is probably a fairly rare use case. Galactic does have the problem. The issue that it is not going to be easy to backport that fix to Galactic in an ABI-preserving way. We'll have to consider it more deeply. |
OK, ros2/rclcpp#1723 was merged for Rolling. I'm going to leave this one open until we figure out how to backport to Galactic. |
ros2/rclcpp#1741 implements the backport to galactic, and has been merged into |
The issue affects ROS Foxy and Galactic, and has been present at least since June 2, 2021 . Latest occurrence : Jul 9th, 2021
The following test cases fail:
with the error :
malloc() : corrupted top size
The text was updated successfully, but these errors were encountered: