Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete closed subscriber initial_pose_sub_ in nav2_amcl may cause use-after-free bug #4166

Closed
GoesM opened this issue Mar 6, 2024 · 7 comments

Comments

@GoesM
Copy link
Contributor

GoesM commented Mar 6, 2024

Bug report

Required Info:

  • Operating System:
    • Ubuntu22.04
  • ROS2 Version:
    • humble
  • Version or commit hash:
    • the latest
  • DDS implementation:
    • defaulted

Steps to reproduce issue

Launch the navigation2 normally, as following steps:

#!/bin/bash
export ASAN_OPTIONS=halt_on_error=0:new_delete_type_mismatch=0:detect_leaks=0:log_pah=asan
source install/setup.bash
export TURTLEBOT3_MODEL=waffle
export GAZEBO_MODEL_PATH=$GAZEBO_MODEL_PATH:/opt/ros/humble/share/turtlebot3_gazebo/models
ros2 launch nav2_bringup tb3_simulation_launch.py headless:=True use_rviz:=False use_composition:=False 

Curious about how navigation2's response to received interference-messages during work, I keep sending messages into topic /initial_pose at intervals.

Finally sent Ctrl+C to shutdown navigation2, which is before stop the msg-sending.

An ASAN report file was discovered in my execution environment.

Expected behavior

Actual behavior

The ASAN reporting a use-after-free bug to me, as following:

=================================================================
==233622==ERROR: AddressSanitizer: heap-use-after-free on address 0x7fe967f0b800 at pc 0x7fe98ff3f066 bp 0x7fffbd145470 sp 0x7fffbd145468
READ of size 4 at 0x7fe967f0b800 thread T0
    #0 0x7fe98ff3f065 in pf_kdtree_insert (/home/*****nav2_humble/install//nav2_amcl/lib/libpf_lib.so+0xb065) (BuildId: 73e413561c1948e30ca366fe29aeef8e601b70f9)
    #1 0x7fe98ff39bc5 in pf_init (/home/*****nav2_humble/install//nav2_amcl/lib/libpf_lib.so+0x5bc5) (BuildId: 73e413561c1948e30ca366fe29aeef8e601b70f9)
    #2 0x7fe99056aba7 in nav2_amcl::AmclNode::handleInitialPose(geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >&) (/home/*****nav2_humble/install//nav2_amcl/lib/libamcl_core.so+0x36aba7) (BuildId: fcf524e51612a7063856b255b0200fa30f1c8a67)
    #3 0x7fe9905685d5 in nav2_amcl::AmclNode::initialPoseReceived(std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > >) (/home/*****nav2_humble/install//nav2_amcl/lib/libamcl_core.so+0x3685d5) (BuildId: fcf524e51612a7063856b255b0200fa30f1c8a67)
    #4 0x7fe9907701ee in void std::__invoke_impl<void, void (nav2_amcl::AmclNode::*&)(std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > >), nav2_amcl::AmclNode*&, std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > > >(std::__invoke_memfun_deref, void (nav2_amcl::AmclNode::*&)(std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > >), nav2_amcl::AmclNode*&, std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > >&&) (/home/*****nav2_humble/install//nav2_amcl/lib/libamcl_core.so+0x5701ee) (BuildId: fcf524e51612a7063856b255b0200fa30f1c8a67)
    #5 0x7fe99079e1f2 in auto rclcpp::AnySubscriptionCallback<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> >::dispatch(std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > >, rclcpp::MessageInfo const&)::'lambda'(auto&&)::operator()<std::function<void (std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > >)>&>(auto&&) const (/home/*****nav2_humble/install//nav2_amcl/lib/libamcl_core.so+0x59e1f2) (BuildId: fcf524e51612a7063856b255b0200fa30f1c8a67)
    #6 0x7fe99079b23b in rclcpp::AnySubscriptionCallback<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> >::dispatch(std::shared_ptr<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> > >, rclcpp::MessageInfo const&) (/home/*****nav2_humble/install//nav2_amcl/lib/libamcl_core.so+0x59b23b) (BuildId: fcf524e51612a7063856b255b0200fa30f1c8a67)
    #7 0x7fe9907799ec in rclcpp::Subscription<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void>, geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, rclcpp::message_memory_strategy::MessageMemoryStrategy<geometry_msgs::msg::PoseWithCovarianceStamped_<std::allocator<void> >, std::allocator<void> > >::handle_message(std::shared_ptr<void>&, rclcpp::MessageInfo const&) (/home/*****nav2_humble/install//nav2_amcl/lib/libamcl_core.so+0x5799ec) (BuildId: fcf524e51612a7063856b255b0200fa30f1c8a67)
    #8 0x7fe99159b7bb in rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) (/opt/ros/humble/lib/librclcpp.so+0xe77bb) (BuildId: 4cca8a387f3c93d38a0567a8efc7cba9106f5d9a)
    #9 0x7fe99159bfbe in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) (/opt/ros/humble/lib/librclcpp.so+0xe7fbe) (BuildId: 4cca8a387f3c93d38a0567a8efc7cba9106f5d9a)
    #10 0x7fe9915a38af in rclcpp::executors::SingleThreadedExecutor::spin() (/opt/ros/humble/lib/librclcpp.so+0xef8af) (BuildId: 4cca8a387f3c93d38a0567a8efc7cba9106f5d9a)
    #11 0x7fe9915a3ac4 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) (/opt/ros/humble/lib/librclcpp.so+0xefac4) (BuildId: 4cca8a387f3c93d38a0567a8efc7cba9106f5d9a)
    #12 0x55d34ed16975 in main (/home/*****nav2_humble/install//nav2_amcl/lib/nav2_amcl/amcl+0xeb975) (BuildId: 9cab2dfb4fd0f7edff9ad0c4896458037daa0009)
    #13 0x7fe98f829d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7fe98f829e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #15 0x55d34ec56514 in _start (/home/*****nav2_humble/install//nav2_amcl/lib/nav2_amcl/amcl+0x2b514) (BuildId: 9cab2dfb4fd0f7edff9ad0c4896458037daa0009)

0x7fe967f0b800 is located 0 bytes inside of 453076200-byte region [0x7fe967f0b800,0x7fe982f21ce8)
freed by thread T1 here:
    #0 0x55d34ecd90b2 in free (/home/*****nav2_humble/install//nav2_amcl/lib/nav2_amcl/amcl+0xae0b2) (BuildId: 9cab2dfb4fd0f7edff9ad0c4896458037daa0009)
    #1 0x7fe98ff3d938 in pf_kdtree_free (/home/*****nav2_humble/install//nav2_amcl/lib/libpf_lib.so+0x9938) (BuildId: 73e413561c1948e30ca366fe29aeef8e601b70f9)
    #2 0x7fe991497b8c  (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x28b8c) (BuildId: e9b8e454bf87aaab775667b79aefcab12c018de9)

previously allocated by thread T0 here:
    #0 0x55d34ecd9548 in __interceptor_calloc (/home/*****nav2_humble/install//nav2_amcl/lib/nav2_amcl/amcl+0xae548) (BuildId: 9cab2dfb4fd0f7edff9ad0c4896458037daa0009)
    #1 0x7fe98ff3d8c7 in pf_kdtree_alloc (/home/*****nav2_humble/install//nav2_amcl/lib/libpf_lib.so+0x98c7) (BuildId: 73e413561c1948e30ca366fe29aeef8e601b70f9)

Thread T1 created by T0 here:
    #0 0x55d34ecc27dc in __interceptor_pthread_create (/home/*****nav2_humble/install//nav2_amcl/lib/nav2_amcl/amcl+0x977dc) (BuildId: 9cab2dfb4fd0f7edff9ad0c4896458037daa0009)
    #1 0x7fe98fcdc328 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xdc328) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2)

SUMMARY: AddressSanitizer: heap-use-after-free (/home/*****nav2_humble/install//nav2_amcl/lib/libpf_lib.so+0xb065) (BuildId: 73e413561c1948e30ca366fe29aeef8e601b70f9) in pf_kdtree_insert
Shadow bytes around the buggy address:
  0x0ffdacfd96b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd96c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd96d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd96e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd96f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0ffdacfd9700:[00]00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd9710: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd9720: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd9730: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd9740: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0ffdacfd9750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==233622==ABORTING

Additional information


We found a pointer named initial_pose_sub created by nav2_amcl, which was the only entry to function initialPoseRecieved(), amcl_node.cpp #L1528-L1530
Also, focus on code lines amcl_node.cpp#L356-L357 , the pointer pf_ would be freed by pf_free(pf_).

  pf_free(pf_);
  pf_ = nullptr;

Thus, if the initial_pose_sub_ is processing a recieved msg util the pf_free(pf_) has been done, initial_pose_sub_ initiates the following function calls: initialPoseReceived() -> handleInitialPose() -> pf_init(), and finally use the freed pointer pf_ in this line amcl_node.cpp#L601

HOWEVER, it's noticed that initial_pose_sub_ has been freed before pf_free(pf_). amcl_node.cpp #L329-357

It's strange. Does it mean that releasing pointer like **_sub_.reset() would not cancel and exit the subscriber's ongoing work task?

a suggestion that may be helfpful:

  • we need a method to make sure all of ongoing work of a subscriber canceled or finished, especially after the pointer of this subscriber is freed by **_sub_.reset() ?
@GoesM
Copy link
Contributor Author

GoesM commented Mar 8, 2024

Additional information

To make sure whether there's any possibility that a subscriber's ongoing work would completely shutdown and exit after **_sub_.reset(), nowadays, I do something following:

Part 1. for experiments

To make sure if subscriber's callback-function like handlerInitialPose() would keep going after initial_pose_sub_.reset(),

code insert

I insert some code for log as following:

1> in function handleInitialPose(), I insert a sleep function to extend function execution period, and log the working state.

void
AmclNode::handleInitialPose(geometry_msgs::msg::PoseWithCovarianceStamped & msg)
{
    //log insert
      RCLCPP_INFO(get_logger(),"[debug]:in handleInitialPose()");
      std::this_thread::sleep_for(std::chrono::seconds(1));// sleep for 1s
      RCLCPP_INFO(get_logger(),"[debug]:still in handlerInitialPose()");
    //insert end
...
...

2> In function on_cleanup(), I insert log to show the reset() has been done.

nav2_util::CallbackReturn
AmclNode::on_cleanup(const rclcpp_lifecycle::State & /*state*/)
{
  RCLCPP_INFO(get_logger(), "Cleaning up");
...
...
  initial_pose_sub_.reset();
    //log insert
      RCLCPP_INFO(get_logger(),"[debug]:initial_pose_sub_.reset() has been done");
    //insert end
...

Thus, if there's a log "still in handlerInitialPose()" occurs after initial_pose_sub_.reset() has been done, it could prove that **sub_.reset() could not make sure all of the subscriber's ongoing work finished.

how I use nav2 within inserted code.

  • step 1. launch navigation2 normally
  • step 2. sending msg onto topic /initialpose within 20HZ
  • step 3. Ctrl+C to navigation2 randomly

experimental result

try for many times, I find some log of amcl as following:

[INFO] [1709868236.040625871] [amcl]: initialPoseReceived
[INFO] [1709868236.040660754] [amcl]: [debug]:in handleInitialPose()
[INFO] [1709868236.221511090] [rclcpp]: signal_handler(signum=2)
[INFO] [1709868236.221737977] [amcl]: Running Nav2 LifecycleNode rcl preshutdown (amcl)
[INFO] [1709868236.221855760] [amcl]: Deactivating
[INFO] [1709868236.221876835] [amcl]: Destroying bond (amcl) to lifecycle manager.
[INFO] [1709868236.323455736] [amcl]: Cleaning up
[INFO] [1709868236.324634485] [amcl]: [debug]:initial_pose_sub_.reset() has been done
[INFO] [1709868237.040899303] [amcl]: [debug]:still in handlerInitialPose()
[WARN] [1709868237.041050883] [amcl]: Failed to transform initial pose in time (Lookup would require extrapolation into the future.  Requested time 1709868122.134358 but the latest data is at time 7.821000, when looking up transform from frame [odom] to frame [base_footprint])
[INFO] [1709868237.041177392] [amcl]: Setting pose (7.791000): -2.094 -0.466 0.081
[INFO] [1709868237.044379407] [amcl]: Destroying bond (amcl) to lifecycle manager.

it's obvious that here's a log still in handlerInitialPose() after initial_pose_sub_.reset() has been done,

which could prove that reset the sub_ pointer could not completely shutdown subscriber's ongoing work , at least, no block here.

Part 2. for logic of the code

reset() seems like a function provided by the class shared_ptr, which is just used for free(ptr) and ptr=null,

while it has no design to make sure that all of the subscriber's ongoing work being finished after reset()

No, a reset will need to wait for a callback complete, but it should block until then - I believe - such that those later lines still cannot be executed

I guess you mean reset() will execute the destruct function of the class subscription, in which it would shutdown callback ? would it wait/block until the callback finished ? if only shutdown and no wait/block here, it's still possible to keep ongoing work after reset(). ( there may be no block here, I think, because of the lack of my reference about how the destruct function is designed )

@SteveMacenski
Copy link
Member

SteveMacenski commented Mar 8, 2024

I thought the subscription class in the destructor would block until complete, but it does not appear that it does after closer inspection of the source code and your logs. I think filing a ticket with rclcpp that the subscriber's destructor does not wait until currently processing callbacks are exited is a good thing to note.

I am, however and separately now that I'm thinking about this more, a little unclear how it is that the subscription callback is possibly still executing if we're executing on the on_cleanup callback. I don't think we construct any new Node or Executor in AMCL or in the nav2_util/LifecycleNode or rclcpp/LifecycleNode, so they should be sharing a single threaded callback group. It shouldn't be possible for those to both be executing at the same time unless we introduce a new thread somewhere that's processing the subscription vs the lifecycle transitions. @clalancette you have more familiarity with the internals of rclcpp, do you know if the lifecycle transitions are on another callback group or executor from the default provided by the node? Or is that almost certainly something in Nav2 that we're doing?

@GoesM
Copy link
Contributor Author

GoesM commented Mar 11, 2024

I think filing a ticket with rclcpp that the subscriber's destructor does not wait until currently processing callbacks are exited is a good thing to note.

I've opened ISSUE filing this ticket with rclcpp : ros2/rclcpp#2447.

By the way , I have had a try to understand how the class subscription designed, however it's really strange that I didn't find any destructor in subscription.cpp of rclcpp stack.... ( Perhaps because I'm too unfamiliar with rclcpp?..... : (

@mjcarroll
Copy link

I took a look based on the ros2/rclcpp#2447 issue, and I don't think this is related to destructors.

Rather, it looks to me that the AmclNode::on_cleanup is cleaning up a particle filter while the subscription for initial_pose is still receiving messages and firing callbacks.

Based on the ASAN trace there, the on_cleanup callback was fired on one thread while the subscription callback was fired on another thread.

It appears that there is a second single threaded executor being spawned in the on_configure method here: https://github.com/ros-planning/navigation2/blob/47374622dee01a27e5f9b8ae08f3d19a15de9b3a/nav2_amcl/src/amcl_node.cpp#L249-L251

@GoesM
Copy link
Contributor Author

GoesM commented Mar 11, 2024

Based on the ASAN trace there, the on_cleanup callback was fired on one thread while the subscription callback was fired on another thread.

yep, I believe so as well.

So I think it would be a solution if we could let the subscription callback thread joint into the on_cleanup thread after sub_.reset() ? so that we could make sure the subscription exit at all.

@mjcarroll
Copy link

Yes, there are several moving pieces here, but I think that once you have reset all of the subscriptions, you should wait to make sure that the executor thread has actually joined before proceeding with the rest of the cleanup.

@SteveMacenski
Copy link
Member

SteveMacenski commented Mar 11, 2024

That is my bad then! I missed that second executor, AMCL isn't something I've spent alot of time looking at since the original port back in 2018. That's something I should have caught before looking you guys in, sorry about that.

If we executor_thread_.reset() that'll cancel and join the thread. Do that before the subscription reset and we should be good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants