Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

malloc() : corrupted top size error in CI on build.osrfoundation.org and build.ros2.org #1296

Closed
adityapande-1995 opened this issue Jul 13, 2021 · 20 comments

Comments

@adityapande-1995
Copy link

adityapande-1995 commented Jul 13, 2021

The issue affects ROS Foxy and Galactic, and has been present at least since June 2, 2021 . Latest occurrence : Jul 9th, 2021

The following test cases fail:

    gazebo_plugins.test_gazebo_ros_ackermann_drive.gtest.missing_result
    gazebo_plugins.test_gazebo_ros_diff_drive.gtest.missing_result
    gazebo_plugins.test_gazebo_ros_ft_sensor.gtest.missing_result
    gazebo_plugins.test_gazebo_ros_hand_of_god.gtest.missing_result
    gazebo_plugins.test_gazebo_ros_harness.gtest.missing_result
    gazebo_plugins.test_gazebo_ros_imu_sensor.gtest.missing_result
    gazebo_plugins.test_gazebo_ros_tricycle_drive.gtest.missing_result
    gazebo_plugins.test_gazebo_ros_vacuum_gripper.gtest.missing_result
    gazebo_ros.test_gazebo_ros_joint_effort.gtest.missing_result
    gazebo_ros.TestSimTime.TestClock

with the error : malloc() : corrupted top size

@scpeters
Copy link
Member

I was able to reproduce the error locally and got the following backtrace:

build/gazebo_ros$ gdb test/test_gazebo_ros_joint_effort test/core 
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.                                                                                                         
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> 
This is free software: you are free to change and redistribute it. 
There is NO WARRANTY, to the extent permitted by law. 
Type "show copying" and "show warranty" for details. 
This GDB was configured as "x86_64-linux-gnu". 
Type "show configuration" for configuration details. 
For bug reporting instructions, please see: 
<http://www.gnu.org/software/gdb/bugs/>. 
Find the GDB manual and other documentation resources online at: 
   <http://www.gnu.org/software/gdb/documentation/>. 

For help, type "help". 
Type "apropos word" to search for commands related to "word"... 
Reading symbols from test/test_gazebo_ros_joint_effort... 
(No debugging symbols found in test/test_gazebo_ros_joint_effort) 

warning: core file may not match specified executable file. 
[New LWP 107956] 
...
[Thread debugging using libthread_db enabled] 
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 
--Type <RET> for more, q to quit, c to continue without paging-- 
Core was generated by `/usr/local/home/scpeter3/ws/gazebo_ros_pkgs/build/gazebo_ros/test/test_gazebo_r'. 
Program terminated with signal SIGABRT, Aborted. 
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. 
[Current thread is 1 (Thread 0x7f8d73fff700 (LWP 107956))] 
(gdb) bt 
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 
#1  0x00007f8e6c9a8859 in __GI_abort () at abort.c:79 
#2  0x00007f8e6ca133ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f8e6cb3d285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 
#3  0x00007f8e6ca1b47c in malloc_printerr (str=str@entry=0x7f8e6cb3b556 "malloc(): corrupted top size") at malloc.c:5347 
#4  0x00007f8e6ca1e83a in _int_malloc (av=av@entry=0x7f8d58000020, bytes=bytes@entry=32) at malloc.c:4107 
#5  0x00007f8e6ca20419 in __GI___libc_malloc (bytes=32) at malloc.c:3066 
#6  0x00007f8e6cd89b39 in operator new(unsigned long) () at /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f8e6e48ba7f in __gnu_cxx::new_allocator<std::weak_ptr<rclcpp::CallbackGroup> >::allocate(unsigned long, void const*)
   (this=0x7f8d73fde420, __n=2) at /usr/include/c++/9/ext/new_allocator.h:102 
#8  std::allocator_traits<std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > >::allocate(std::allocator<std::weak_ptr<rclcpp::CallbackGroup> >&, unsig
ned long) (__a=..., __n=2) at /usr/include/c++/9/bits/alloc_traits.h:444                                                                                  
#9  std::_Vector_base<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > >::_M_allocate(unsigned long)
   (this=0x7f8d73fde420, __n=2) at /usr/include/c++/9/bits/stl_vector.h:343 
#10 std::_Vector_base<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > >::_M_create_storage(unsigned long)
   (__n=2, this=0x7f8d73fde420) at /usr/include/c++/9/bits/stl_vector.h:358 
#11 std::_Vector_base<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > >::_Vector_base(unsigned long, std::all
ocator<std::weak_ptr<rclcpp::CallbackGroup> > const&) (__a=..., __n=2, this=0x7f8d73fde420) at /usr/include/c++/9/bits/stl_vector.h:302                   
#12 std::vector<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > >::vector(std::vector<std::weak_ptr<rclcpp::C
allbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > > const&) (__x=std::vector of length 2, capacity 2 = {...}, this=0x7f8d73fde420)     
   at /usr/include/c++/9/bits/stl_vector.h:552 
#13 rclcpp::Executor::add_callback_groups_from_nodes_associated_to_executor() (this=0x7f8e2f3570d0) 
   at /home/jmperron/ws/ci/ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:193 
#14 0x00007f8e6e48e457 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (this=0x7f8e2f3570d0, timeout=...)
   at /home/jmperron/ws/ci/ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:688 
#15 0x00007f8e6e48ee03 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
   (this=0x7f8e2f3570d0, any_executable=..., timeout=...) at /home/jmperron/ws/ci/ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:904 
#16 0x00007f8e6e4919a8 in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) (this=0x7f8e2f3570d0) 
   at /home/jmperron/ws/ci/ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:86 
#17 0x00007f8e6cdb5de4 in  () at /lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007f8e6d1a0609 in start_thread (arg=<optimized out>) at pthread_create.c:477 
#19 0x00007f8e6caa5293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 
(gdb)

@scpeters
Copy link
Member

I can also reproduce the failure using galactic debs:

ii  ros-galactic-rclcpp                                 9.1.0-1focal.20210521.003406          amd64        The ROS client library in C++.

@scpeters
Copy link
Member

I can also reproduce the failure using galactic debs:

ii  ros-galactic-rclcpp                                 9.1.0-1focal.20210521.003406          amd64        The ROS client library in C++.

here is a backtrace:

malloc(): corrupted top size
--Type <RET> for more, q to quit, c to continue without paging--c

Thread 39 "test_gazebo_ros" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff4bfff700 (LWP 1815)]
0x00007ffff644818b in raise () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff644818b in raise () at /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff6427859 in abort () at /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff64923ee in  () at /usr/lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff649a47c in  () at /usr/lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ffff649d83a in  () at /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007ffff649f419 in malloc () at /usr/lib/x86_64-linux-gnu/libc.so.6
#6  0x00007ffff6808b39 in operator new(unsigned long) () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7eb2953 in __gnu_cxx::new_allocator<std::shared_ptr<rcl_service_t const> >::allocate(unsigned long, void const*) (this=<optimized out>, __n=<optimized out>)
    at /usr/include/c++/9/ext/atomicity.h:69
#8  std::allocator_traits<std::allocator<std::shared_ptr<rcl_service_t const> > >::allocate(std::allocator<std::shared_ptr<rcl_service_t const> >&, unsigned long) (__a=..., __n=<optimized out>)
    at /usr/include/c++/9/bits/alloc_traits.h:444
#9  std::_Vector_base<std::shared_ptr<rcl_service_t const>, std::allocator<std::shared_ptr<rcl_service_t const> > >::_M_allocate(unsigned long) (this=<optimized out>, __n=<optimized out>)
    at /usr/include/c++/9/bits/stl_vector.h:343
#10 std::vector<std::shared_ptr<rcl_service_t const>, std::allocator<std::shared_ptr<rcl_service_t const> > >::_M_realloc_insert<std::shared_ptr<rcl_service_t const> >(__gnu_cxx::__normal_iterator<std::shared_ptr<rcl_service_t const>*, std::vector<std::shared_ptr<rcl_service_t const>, std::allocator<std::shared_ptr<rcl_service_t const> > > >, std::shared_ptr<rcl_service_t const>&&)
    (__position=<error reading variable: Cannot access memory at address 0x5d>, this=0x7fffd97973d8) at /usr/include/c++/9/bits/vector.tcc:440
#11 std::vector<std::shared_ptr<rcl_service_t const>, std::allocator<std::shared_ptr<rcl_service_t const> > >::emplace_back<std::shared_ptr<rcl_service_t const> >(std::shared_ptr<rcl_service_t const>&&)
    (this=0x7fffd97973d8) at /usr/include/c++/9/bits/vector.tcc:121
#12 std::vector<std::shared_ptr<rcl_service_t const>, std::allocator<std::shared_ptr<rcl_service_t const> > >::push_back(std::shared_ptr<rcl_service_t const>&&) (__x=..., this=0x7fffd97973d8)
    at /usr/include/c++/9/bits/stl_vector.h:1201
#13 rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::collect_entities(std::map<std::weak_ptr<rclcpp::CallbackGroup>, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::owner_less<std::weak_ptr<rclcpp::CallbackGroup> >, std::allocator<std::pair<std::weak_ptr<rclcpp::CallbackGroup> const, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface> > > > const&)::{lambda(std::shared_ptr<rclcpp::ServiceBase> const&)#2}::operator()(std::shared_ptr<rclcpp::ServiceBase> const&) constPython Exception <class 'gdb.error'> value has been optimized out: 
 (service=<synthetic pointer>, this=<synthetic pointer>)
    at ./include/rclcpp/strategies/allocator_memory_strategy.hpp:173
#14 rclcpp::CallbackGroup::_find_ptrs_if_impl<rclcpp::ServiceBase, rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::collect_entities(std::map<std::weak_ptr<rclcpp::CallbackGroup>, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::owner_less<std::weak_ptr<rclcpp::CallbackGroup> >, std::allocator<std::pair<std::weak_ptr<rclcpp::CallbackGroup> const, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface> > > > const&)::{lambda(std::shared_ptr<rclcpp::ServiceBase> const&)#2}>(rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::collect_entities(std::map<std::weak_ptr<rclcpp::CallbackGroup>, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::owner_less<std::weak_ptr<rclcpp::CallbackGroup> >, std::allocator<std::pair<std::weak_ptr<rclcpp::CallbackGroup> const, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface> > > > const&)::{lambda(std::shared_ptr<rclcpp::ServiceBase> const&)#2}, std::vector<rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::collect_entities(std::map<std::weak_ptr<rclcpp::CallbackGroup>, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::owner_less<std::weak_ptr<rclcpp::CallbackGroup> >, std::allocator<std::pair<std::weak_ptr<rclcpp::CallbackGroup> const, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface> > > > const&)::{lambda(std::shared_ptr<rclcpp::ServiceBase> const&)#2}::WeakPtr, std::allocator<std::vector> > const&) const (vect_ptrs=..., func=..., this=<optimized out>) at ./include/rclcpp/callback_group.hpp:217
#15 rclcpp::CallbackGroup::find_service_ptrs_if<rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::collect_entities(std::map<std::weak_ptr<rclcpp::CallbackGroup>, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::owner_less<std::weak_ptr<rclcpp::CallbackGroup> >, std::allocator<std::pair<std::weak_ptr<rclcpp::CallbackGroup> const, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface> > > > const&)::{lambda(std::shared_ptr<rclcpp::ServiceBase> const&)#2}>(rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::collect_entities(std::map<std::weak_ptr<rclcpp::CallbackGroup>, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::owner_less<std::weak_ptr<rclcpp::CallbackGroup> >, std::allocator<std::pair<std::weak_ptr<rclcpp::CallbackGroup> const, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface> > > > const&)::{lambda(std::shared_ptr<rclcpp::ServiceBase> const&)#2}) const
--Type <RET> for more, q to quit, c to continue without paging--c
    (func=..., this=<optimized out>) at ./include/rclcpp/callback_group.hpp:116
#16 rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::collect_entities(std::map<std::weak_ptr<rclcpp::CallbackGroup>, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::owner_less<std::weak_ptr<rclcpp::CallbackGroup> >, std::allocator<std::pair<std::weak_ptr<rclcpp::CallbackGroup> const, std::weak_ptr<rclcpp::node_interfaces::NodeBaseInterface> > > > const&) (this=0x7fffd97973a0, weak_groups_to_nodes=Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
std::map with 1 element) at ./include/rclcpp/strategies/allocator_memory_strategy.hpp:171
#17 0x00007ffff7e9ac2b in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (this=0x7fffd9bbef40, timeout=...) at /usr/include/c++/9/bits/shared_ptr_base.h:1020
#18 0x00007ffff7e9b5b3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (this=0x7fffd9bbef40, any_executable=..., timeout=...) at ./src/rclcpp/executor.cpp:903
#19 0x00007ffff7e9e0b8 in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) (this=0x7fffd9bbef40) at ./src/rclcpp/executors/multi_threaded_executor.cpp:86
#20 0x00007ffff6834de4 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007ffff6c25609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007ffff6524293 in clone () at /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) 

@scpeters
Copy link
Member

I have tried building rclcpp from source, using the 9.1.0 tag, which should be the same version as the Debian packages, but I have not been able to observe the failure in this case

@clalancette
Copy link
Contributor

I've been playing around with this as well. Here are some observations:

  1. Like @scpeters, I can reproduce this failure by building and running tests of gazebo_ros against debian-installed Galactic.
  2. Also like @scpeters, I could not reproduce while building and running from Galactic source, not even if I put the entire workspace to the exact versions that are released into the debians.
  3. Even when building against the debians, I can only reproduce the failure sometimes. My limited tests so far show that I can reproduce about 40% of the time. Adding CPU load to the system seems to actually decrease the failure rate in my case, down to about 2%.
  4. Besides the malloc(): corrupted top size issue, I've seen two other issues while trying to reproduce the test. One is:
boost::wrapexcept<boost::thread_resource_error>'
  what():  boost thread: trying joining itself: Resource deadlock avoided

which reproduces about 4% of the time. The other is:

[Err] [Connection.hh:211] AsyncRead on a closed socket

which only happened once.

  1. I have not yet been successful in reproducing the error with valgrind or asan yet. In this particular case, I think reproducing under one of these is preferable to a stack trace or core dump, since the problem likely stems from an earlier double-free (or something like that), and hence the place where it fails is the victim, not the cause. I'm going to continue to try with asan here for a while.

@scpeters
Copy link
Member

  1. Even when building against the debians, I can only reproduce the failure sometimes. My limited tests so far show that I can reproduce about 40% of the time. Adding CPU load to the system seems to actually decrease the failure rate in my case, down to about 2%.

I can confirm that it's a flaky failure, though I haven't quantified it to this extent

  1. Besides the malloc(): corrupted top size issue, I've seen two other issues while trying to reproduce the test. One is:
boost::wrapexcept<boost::thread_resource_error>'
  what():  boost thread: trying joining itself: Resource deadlock avoided

which reproduces about 4% of the time.

I see this error frequently in osrf/gazebo test failures. It is mentioned in gazebosim/gazebo-classic#2019, though I'm not sure that issue complete captures the issue. At any rate, I don't think it is a ros or gazebo_ros problem

The other is:

[Err] [Connection.hh:211] AsyncRead on a closed socket

which only happened once.

this sounds like an osrf/gazebo issue

I've also seen one failure with the following message:

free(): invalid next size (fast)

@scpeters
Copy link
Member

I have not yet been successful in reproducing the error with valgrind

I also haven't been able to reproduce the error with valgrind

@clalancette
Copy link
Contributor

I was finally able to get a failure with asan (after about an hour of running the test). For posterity, here's what I'm seeing:

Gazebo multi-robot simulator, version 11.5.1
Copyright (C) 2012 Open Source Robotics Foundation.
Released under the Apache 2 License.
http://gazebosim.org

[Msg] Waiting for master.
[Msg] Connected to gazebo master @ http://127.0.0.1:11345
[Msg] Publicized address: 192.168.XX.XX
AddressSanitizer:DEADLYSIGNAL
=================================================================
==952759==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fdfbc95f270 bp 0x60e000638dc0 sp 0x7fdf9529f370 T21)
==952759==The signal is caused by a READ memory access.
==952759==Hint: address points to the zero page.
    #0 0x7fdfbc95f26f in std::vector<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > >::vector(std::vector<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > > const&) /usr/include/c++/9/bits/stl_uninitialized.h:79
    #1 0x7fdfbc95f26f in rclcpp::Executor::add_callback_groups_from_nodes_associated_to_executor() src/rclcpp/executor.cpp:193
    #2 0x7fdfbc961c06 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) src/rclcpp/executor.cpp:687
    #3 0x7fdfbc9625b2 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) src/rclcpp/executor.cpp:903
    #4 0x7fdfbc9650b7 in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) src/rclcpp/executors/multi_threaded_executor.cpp:86
    #5 0x7fdfbb2f5de3  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd6de3)
    #6 0x7fdfbb6e6608 in start_thread /build/glibc-YbNSs7/glibc-2.31/nptl/pthread_create.c:477
    #7 0x7fdfbafe5292 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x122292)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /usr/include/c++/9/bits/stl_uninitialized.h:79 in std::vector<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > >::vector(std::vector<std::weak_ptr<rclcpp::CallbackGroup>, std::allocator<std::weak_ptr<rclcpp::CallbackGroup> > > const&)
Thread T21 created by T14 here:
    #0 0x7fdfbcab7805 in pthread_create (/lib/x86_64-linux-gnu/libasan.so.5+0x3a805)
    #1 0x7fdfbb2f60a8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd70a8)

Thread T14 created by T2 here:
    #0 0x7fdfbcab7805 in pthread_create (/lib/x86_64-linux-gnu/libasan.so.5+0x3a805)
    #1 0x7fdfbb2f60a8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd70a8)
    #2 0x7fdf9edb9acd in gazebo_ros::Executor::Executor() (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x100acd)
    #3 0x7fdf9ee485b2 in void __gnu_cxx::new_allocator<gazebo_ros::Executor>::construct<gazebo_ros::Executor>(gazebo_ros::Executor*) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x18f5b2)
    #4 0x7fdf9ee480c2 in void std::allocator_traits<std::allocator<gazebo_ros::Executor> >::construct<gazebo_ros::Executor>(std::allocator<gazebo_ros::Executor>&, gazebo_ros::Executor*) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x18f0c2)
    #5 0x7fdf9ee476c9 in std::_Sp_counted_ptr_inplace<gazebo_ros::Executor, std::allocator<gazebo_ros::Executor>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<>(std::allocator<gazebo_ros::Executor>) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x18e6c9)
    #6 0x7fdf9ee45639 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<gazebo_ros::Executor, std::allocator<gazebo_ros::Executor>>(gazebo_ros::Executor*&, std::_Sp_alloc_shared_tag<std::allocator<gazebo_ros::Executor> >) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x18c639)
    #7 0x7fdf9ee43213 in std::__shared_ptr<gazebo_ros::Executor, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<gazebo_ros::Executor>>(std::_Sp_alloc_shared_tag<std::allocator<gazebo_ros::Executor> >) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x18a213)
    #8 0x7fdf9ee3f63a in std::shared_ptr<gazebo_ros::Executor>::shared_ptr<std::allocator<gazebo_ros::Executor>>(std::_Sp_alloc_shared_tag<std::allocator<gazebo_ros::Executor> >) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x18663a)
    #9 0x7fdf9ee3768b in std::shared_ptr<gazebo_ros::Executor> std::allocate_shared<gazebo_ros::Executor, std::allocator<gazebo_ros::Executor>>(std::allocator<gazebo_ros::Executor> const&) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x17e68b)
    #10 0x7fdf9ee32629 in std::shared_ptr<gazebo_ros::Executor> std::make_shared<gazebo_ros::Executor>() (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x179629)
    #11 0x7fdf9ee2daf3 in std::shared_ptr<gazebo_ros::Node> gazebo_ros::Node::CreateWithArgs<char const (&) [7], char const (&) [1], rclcpp::NodeOptions&>(char const (&) [7], char const (&) [1], rclcpp::NodeOptions&) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x174af3)
    #12 0x7fdf9ee243ea in gazebo_ros::Node::Get() (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_node.so+0x16b3ea)
    #13 0x7fdf9f06cc5a in gazebo_ros::GazeboRosInit::Load(int, char**) (/home/ubuntu/gazebo_ros_ws/install/gazebo_ros/lib/libgazebo_ros_init.so+0x17bc5a)
    #14 0x7fdfbc83306b  (/lib/x86_64-linux-gnu/libgazebo.so.11+0x6b06b)

Thread T2 created by T0 here:
    #0 0x7fdfbcab7805 in pthread_create (/lib/x86_64-linux-gnu/libasan.so.5+0x3a805)
    #1 0x7fdfbb4ce441 in boost::thread::start_thread_noexcept() (/lib/x86_64-linux-gnu/libboost_thread.so.1.71.0+0x10441)
    #2 0x560ebe52bbaa in GazeboRosInitTest_Commands_Test::TestBody() (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x4dbaa)
    #3 0x560ebe638086 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x15a086)
    #4 0x560ebe628e41 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x14ae41)
    #5 0x560ebe5d2549 in testing::Test::Run() (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0xf4549)
    #6 0x560ebe5d39d0 in testing::TestInfo::Run() (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0xf59d0)
    #7 0x560ebe5d4705 in testing::TestSuite::Run() (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0xf6705)
    #8 0x560ebe5f11e1 in testing::internal::UnitTestImpl::RunAllTests() (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x1131e1)
    #9 0x560ebe63b731 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x15d731)
    #10 0x560ebe62b97c in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x14d97c)
    #11 0x560ebe5edc75 in testing::UnitTest::Run() (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x10fc75)
    #12 0x560ebe53b87f in RUN_ALL_TESTS() (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x5d87f)
    #13 0x560ebe5342c9 in main (/home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init+0x562c9)
    #14 0x7fdfbaeea0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

==952759==ABORTING

It's not in exactly the same place as the original issue, but of course ASAN is messing with all kinds of malloc-related stuff to do its work. So this may be the crash. Now I need to try to figure out what is going on with it.

@clalancette
Copy link
Contributor

I was also able to get this to fail by building my galactic workspace from source, but remembering to build with:

colcon build --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo

So that should give me a slightly better information when it does crash.

@clalancette
Copy link
Contributor

All right, after some additional poking this morning, I think this is a problem in Gazebo, not in the ROS 2 libraries.

First of all, my reproduction consists of building a Galactic workspace with colcon build --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo as above. With that in place, I then have an overlay workspace where I checked out https://github.com/ros-simulation/gazebo_ros_pkgs to the "aditya_metrics" branch (this is to reproduce the exact failure from CI). I then built this overlay workspace with colcon build --packages-up-to gazebo_ros.

With all of that in place, I was able to directly run one of the failing tests, test_gazebo_ros_init. I did that with the following:

cd gazebo_ros_ws/build/gazebo_ros/test
while true ; do /home/ubuntu/gazebo_ros_ws/build/gazebo_ros/test/test_gazebo_ros_init ; if [ $? -ne 0 ] ; then break ; fi ; done

Letting that run usually reproduces the error very quickly.

I then went into the code for test_gazebo_ros_init, and modified it to remove almost everything except for the initialization of gazebo. That is, the code I ended up with looks like:

#include <gazebo/test/ServerFixture.hh>

class GazeboRosInitTest : public gazebo::ServerFixture
{
};

TEST_F(GazeboRosInitTest, Commands)
{
  this->LoadArgs("-u --verbose -s libgazebo_ros_init.so worlds/free_fall.world");

  auto world = gazebo::physics::get_world();
  ASSERT_NE(nullptr, world);

  auto model = world->ModelByName("box");
  ASSERT_NE(nullptr, model);
}

int main(int argc, char ** argv)
{
  ::testing::InitGoogleTest(&argc, argv);
  return RUN_ALL_TESTS();
}

When I rebuilt and ran that in the same loop, it took longer to reproduce the problem, but I could still reproduce.

I then went even further and removed the linking of the ROS 2 libraries while building (i.e. I removed https://github.com/ros-simulation/gazebo_ros_pkgs/blob/aditya_metrics/gazebo_ros/test/CMakeLists.txt#L62-L70), just to make sure there was no C++ initialization problems. When I rebuilt and ran with that, it still takes a while, but it does eventually fail with the malloc() error.

So I think it is some combination of the gazebo::ServerFixture class, LoadArgs method, gazebo::physics::get_world() call, or world->ModelByName() which is causing the problem. I still have to poke at these more to try to figure out what is going on.

@clalancette
Copy link
Contributor

OK, with some further poking, I figured out that just calling this->LoadArgs is enough to cause the issue.

But I also realized that what this is doing is loading a shared library that does link against ROS 2. So I still don't know whether the issue is in Gazebo or ROS 2, though I've reduced the search space. I'm continuing to debug.

@scpeters
Copy link
Member

OK, with some further poking, I figured out that just calling this->LoadArgs is enough to cause the issue.

But I also realized that what this is doing is loading a shared library that does link against ROS 2. So I still don't know whether the issue is in Gazebo or ROS 2, though I've reduced the search space. I'm continuing to debug.

yeah, I was just about to point out that -s libgazebo_ros_init.so is what loads gazebo_ros

I'll also point out that I've never seen this failure in the foxy CI jobs, only galactic and rolling. If it's a gazebo problem, I would think we'd be able to reproduce it in foxy

@adityapande-1995
Copy link
Author

adityapande-1995 commented Jul 15, 2021

@scpeters just to support, I've run the test more than 1500 times with foxy, and it never failed with the malloc() error. Also, not sure if this is relevant, but debain foxy (which never failed) is running 3.5.3 for gazebo_(dev,msgs, plugins, ros, ros_pkgs) whereas debain galactic is running 3.5.2. Is galactic running an older version ?

Following is the strace output when it fails:

brk(0x5648be2b7000)                     = 0x5648be2b7000
brk(0x5648be2da000)                     = 0x5648be2da000
brk(0x5648be2fb000)                     = 0x5648be2fb000
brk(0x5648be31c000)                     = 0x5648be31c000
brk(0x5648be33d000)                     = 0x5648be33d000
getcwd("/root/dev_ws/build/gazebo_ros/test", 4097) = 35
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
write(1, "\33[0;32m[==========] \33[mRunning 1"..., 57[==========] Running 1 test from 1 test suite.
) = 57
write(1, "\33[0;32m[----------] \33[mGlobal te"..., 55[----------] Global test environment set-up.
) = 55
write(1, "\33[0;32m[----------] \33[m1 test fr"..., 53[----------] 1 test from GazeboRosInitTest
) = 53
write(1, "\33[0;32m[ RUN      ] \33[mGazeboRos"..., 50[ RUN      ] GazeboRosInitTest.Commands
) = 50
stat("/root/.gazebo/test-11345", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/root/.gazebo/test-11345/test.log", {st_mode=S_IFREG|0644, st_size=157, ...}) = 0
openat(AT_FDCWD, "/root/.gazebo/test-11345/test.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
write(3, "Gazebo multi-robot simulator, ve"..., 157) = 157
stat("/root/.gazebo/test-11345/test.log", {st_mode=S_IFREG|0644, st_size=157, ...}) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f7a6d7a1000
mprotect(0x7f7a6d7a2000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7f7a6df9ef70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2387], tls=0x7f7a6dfa1700, child_tidptr=0x7f7a6dfa19d0) = 2387
clock_nanosleep(CLOCK_MONOTONIC, 0, {tv_sec=0, tv_nsec=100000000}, Gazebo multi-robot simulator, version 11.5.1
Copyright (C) 2012 Open Source Robotics Foundation.
Released under the Apache 2 License.
http://gazebosim.org

0x7ffe3379d090) = 0
clock_nanosleep(CLOCK_MONOTONIC, 0, {tv_sec=0, tv_nsec=100000000}, 0x7ffe3379d090) = 0
clock_nanosleep(CLOCK_MONOTONIC, 0, {tv_sec=0, tv_nsec=100000000}, malloc(): corrupted top size
 <unfinished ...>) = ?
+++ killed by SIGABRT (core dumped) +++
Aborted (core dumped)

@clalancette
Copy link
Contributor

Commenting out https://github.com/ros2/rclcpp/blob/galactic/rclcpp/src/rclcpp/time_source.cpp#L129-L133 seems to fix the issue for me. However, I don't really have any idea why, so I still need to do further debugging here.

@adityapande-1995
Copy link
Author

@clalancette supporting your point that this->LoadArgs() itself is enough to cause the error, I was able to reproduce the same error by just running gzserver -u --verbose -s libgazebo_ros_init.so worlds/free_fall.world. It occurred once in 62 tries. Removing the -u did not help. Could not get it to fail with just gzserver --verbose, so nothing wrong with gazebo probably.

@clalancette
Copy link
Contributor

Going further, the problem seems to stem from this call: https://github.com/ros2/rclcpp/blob/galactic/rclcpp/src/rclcpp/time_source.cpp#L254-L257 . If I comment out just that line (and the other couple of lines that reference the clock_callback_group_), I can no longer reproduce the error. That said, I've looked in some detail into create_callback_group, and nothing is sticking out at me as obviously wrong. More debugging needed.

@clalancette
Copy link
Contributor

All right, I'm fairly sure that this is unprotected access to the callback_groups_ variable from multiple threads. When I add in locks around that, the problem seems to go away (though I'm still testing).

While we are in here, we should probably deprecate or remove get_callback_groups; there isn't really a thread-safe way to access it.

@clalancette
Copy link
Contributor

See ros2/rclcpp#1723, which should fix this issue in Rolling.

This problem doesn't directly exist in Foxy, since we don't have threaded clock handling there. It could exist in theory if some other code manipulated callback groups from multiple threads, but it is probably a fairly rare use case.

Galactic does have the problem. The issue that it is not going to be easy to backport that fix to Galactic in an ABI-preserving way. We'll have to consider it more deeply.

@clalancette
Copy link
Contributor

OK, ros2/rclcpp#1723 was merged for Rolling.

I'm going to leave this one open until we figure out how to backport to Galactic.

@adityapande-1995
Copy link
Author

adityapande-1995 commented Aug 19, 2021

ros2/rclcpp#1741 implements the backport to galactic, and has been merged into main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants