Codecov metrics for Assisted Teleop / Drive on Heading behaviors not being picked up #3271

SteveMacenski · 2022-11-07T19:07:52Z

This is especially odd since the other plugins are getting their metrics reported correctly (wait, spin, backup) and backup uses alot of drive on heading as part of its code. I can't seem to find any reason for this after a couple of hours of initial debugging.

padhupradheep · 2022-11-08T06:22:57Z

Just a small hunch, I think by default assisted_teleop is not included as part of the default behavior tree nor in any of the BT's that we have. Could that be a cause ? 🤔

SteveMacenski · 2022-11-08T07:12:37Z

No, the system tests call the behavior directly

SteveMacenski · 2022-12-01T01:18:33Z

I was playing around this afternoon in #3305. Forward declaring template class did not help (I suspected something odd was happening with the template type patterns), nor did matching the public/private functions + destructor file location (not that I thought that would help, but it was one of the very things different about the 2 failing plugins from the others).

The files look to me to be exactly the same otherwise. As well with the system tests, only difference I can find is drive on heading uses goal.time_allowance.sec = test_params.time_allowance; in test cases. I'm stumped at the moment. I notice the tests aren't failing cleanly which could potentially be part of it? But doesn't explain why some work and some don't, they all appear to have that issue. Next up to diagnose the ghost in the system is to start renaming stuff and moving files around to see if I can isolate it to something in cmake / linking potentially (e.g. swap the test code or plugin code to see which it follows - issue is in system tests or in the code itself // swap order of build / link to see if its related to something in CI with build order and caching).

mmattb · 2023-01-06T07:28:35Z

Thanks to Steve for pointing this issue out to me on the discussion forums. I'd love to help out, since I've touched code coverage mechanics before in other contexts. Can somebody point me to some materials I can use to get bootstrapped on this? e.g. what packages and what tests are involved, how do I install and run them, what does your workflow for gathering the code cov info look like? Meanwhile, I'll install nav2 and try to play with it some in the next couple of days.

SteveMacenski · 2023-01-06T19:25:03Z

Hi,

Oh my god, what a savior! Let me start with some context and resources, then I'll answer your specifics.

So this all started when we noticed when we moved to 22.04/Humble (and Rolling's cycle to 22.04 too) that our coverage dropped significantly due to missing coverage with what seemed to be most (if not all) of the Nav2 System Tests. The unit testing in each package seemed to be fine. We didn't really understand why. I thrashed around a bit and found that if I tested the stack with use_composition=false instead, that we got the vast majority of our test coverage for the system tests back. Weird, but I moved on thinking that it wasn't worth the time to find root cause if we had a solution and this isn't a functional problem for users, just in testing metrics (though, I feel the dark technical debt monster waiting to bite me). Might also be something of interest for you to look into if you're curious. I think it might (?) have something to do with a non-clean exit of the tests due to elevated killing of the processes after the test completes successfully, but I never actually got to testing it. I thought it may have been 30 other things too, so I'm not sure you should put too much weight in that diagnosis, unless you have some experience that tells you that this is definitely an issue with collecting coverage metrics. Though the fact that we're not cleanly exiting in composition is a problem in its own right regardless (which might also be a nice contribution!).

After that, I noticed that basically the only thing I could easily identify now as missing was in the behavior server, namely the assisted teleop and drive on heading plugins, which oddly enough are also the 2 newest plugins, by years. We have system tests which cover them but for some reason some or many of the functions that are being called from them aren't getting picked up. I tried a bunch of stuff to see but after 3 days I ran out of time and had to move on to complete my 2022 goals and my plan of action for continuing is above to try a few desperate last-ditch things to see if I could identify where the issue lies (in the behavior package, or the system tests).

Resources I've been using:

These are the behavior system tests https://github.com/ros-planning/navigation2/tree/main/nav2_system_tests/src/behaviors
The behaviors source code https://github.com/ros-planning/navigation2/tree/main/nav2_behaviors
Our codecov entry, which builds all PRs and branches on the main repo https://app.codecov.io/gh/ros-planning/navigation2
The codecov showing the issue https://app.codecov.io/gh/ros-planning/navigation2/tree/main/nav2_behaviors/plugins

Can somebody point me to some materials I can use to get bootstrapped on this?

In addition to the resources above, just the usual build / install instructions for Nav2/ROS2. Install Rolling binaries on 22.04 + Nav2 from source in a workspace colcon build. I gather code coverage results from pushing branches to the git server and running circleCI or opening a draft PR while I play around. I believe you can run codecov locally but I didn't bother to try to figure out all that since CI didn't run much slower and gave me CPU time to do other things in parallel. If you want to do this too, you can open a draft PR (which I'll just unsubscribe to until you tell me to look at it) to push and play at your pleasure. Or I can give you tentative access to push branches if you don't want to make as much noise.

These are some instructions for executing individual tests and generating a report using lcov ros-navigation/docs.nav2.org#88 (comment) - though as I said, I never actually tried using that for this particular issue (yet) so I don't actually know what that would report is any different.

mmattb · 2023-01-07T09:42:05Z

Ah I see. I've seen a bunch of things which can look like this. Metrics not flushing atexit seemed like the most common one, which indeed could indicate something strange with the process unwind/atexit/exit. Another had to do with weird dynamic linking details. Another is the code coverage tool not adapting to certain optimizations, macros, or code gen, but this code on its face looks too vanilla to be running afoul of those.

I'm working on getting the tests running on my local machine. My intuition is to grep the symbols in the binary and the codecov microdata to bisect the problem. I might end up tinkering with the shutdown/killing process since that does seems like a good starting point. When somebody doesn't have control over a strange shutdown sequence, they may sometimes conditionally compile in manual flushes in some places. That's a big hammer, but in our case it could give us a way to diagnose the problem. StackOverflow example.

Who can help me get to the bottom of why the tests just spin forever around setting the initial pose? Any hints?

I was able to invoke the unit tests using the tutorial, and to get code cov data from those (great tutorial!). It seems that the system tests are invoked using some Python. I stripped the command out so I can run the test directly, but It just spins on the initial pose. I double checked all the paths in the arguments and they seem to point to real things. Before this I built the production and test code using the code cov flags from the tutorial. Help?

mbryan@pad:~/Projects/ros/navigation2$ . /usr/share/gazebo/setup.sh
mbryan@pad:~/Projects/ros/navigation2$ . install/local_setup.bash
mbryan@pad:~/Projects/ros/navigation2$ /usr/bin/python3.10 -u /opt/ros/rolling/share/ament_cmake_test/cmake \\
/run_test.py /home/mbryan/Projects/ros/navigation2/build/nav2_system_tests/test_results/nav2_system_tests \\
/test_drive_on_heading_recovery.xml --package-name nav2_system_tests --generate-result-on-success --env \\
TEST_MAP=/home/mbryan/Projects/ros/navigation2/src/nav2_system_tests/maps/map_circular.yaml \\
TEST_EXECUTABLE=/home/mbryan/Projects/ros/navigation2/build/nav2_system_tests/src/behaviors/drive_on_heading \\
/test_drive_on_heading_behavior_node TEST_WORLD=/home/mbryan/Projects/ros/navigation2/src/nav2_system_tests \\
/worlds/turtlebot3_ros2_demo.world GAZEBO_MODEL_PATH=/home/mbryan/Projects/ros/navigation2 \\
/src/nav2_system_tests/models BT_NAVIGATOR_XML=navigate_to_pose_w_replanning_and_recovery.xml --command \\
/home/mbryan/Projects/ros/navigation2/src/nav2_system_tests/src/behaviors/drive_on_heading \\
/test_drive_on_heading_behavior_launch.py

...
[amcl-5] [INFO] [1673084822.350070491] [amcl]: initialPoseReceived
[amcl-5] [INFO] [1673084822.350217566] [amcl]: Setting pose (0.000000): -2.000 -0.500 0.000
[controller_server-7] [INFO] [1673084822.384258479] [local_costmap.local_costmap]: Timed out waiting for transform from base_link to odom to become available, tf error: Invalid frame ID "odom" passed to canTransform argument target_frame - frame does not exist
[test_drive_on_heading_behavior_node-15] [WARN] [1673084822.450276743] [DriveOnHeading_behavior_test]: Initial pose not received
[test_drive_on_heading_behavior_node-15] [INFO] [1673084822.450458710] [DriveOnHeading_behavior_test]: Sent initial pose
[amcl-5] [INFO] [1673084822.450717949] [amcl]: initialPoseReceived
[amcl-5] [INFO] [1673084822.450862856] [amcl]: Setting pose (0.000000): -2.000 -0.500 0.000
[test_drive_on_heading_behavior_node-15] [WARN] [1673084822.551006446] [DriveOnHeading_behavior_test]: Initial pose not received
[test_drive_on_heading_behavior_node-15] [INFO] [1673084822.551185445] [DriveOnHeading_behavior_test]: Sent initial pose
[amcl-5] [INFO] [1673084822.551468447] [amcl]: initialPoseReceived
[amcl-5] [INFO] [1673084822.551616211] [amcl]: Setting pose (0.000000): -2.000 -0.500 0.000
[test_drive_on_heading_behavior_node-15] [WARN] [1673084822.651665593] [DriveOnHeading_behavior_test]: Initial pose not received
[test_drive_on_heading_behavior_node-15] [INFO] [1673084822.651846598] [DriveOnHeading_behavior_test]: Sent initial pose
[amcl-5] [INFO] [1673084822.652175103] [amcl]: initialPoseReceived


...

I'll be on a train all day tomorrow; will try to pick this up later.

SteveMacenski · 2023-01-09T19:06:00Z

Something to note too looking at the code is that the DriveOnHeading and the BackUp behaviors both share a ton of code (BackUp is derived from DriveOnHeading). Of the duplicated code that exists, its mostly to enforce protections on the backup version, so doing an apples-to-apples comparison of them is possible and as far as I can tell are the exact same. Yet one provides the results and the other is missing some.

Who can help me get to the bottom of why the tests just spin forever around setting the initial pose? Any hints?

Timed out waiting for transform from base_link to odom to become available

That sounds like gazebo might not be running, it might be worth running those tests via the cmake instead of invoking them manually, but that's mostly my "I don't really know test/coverage tools that well, so don't break what works" speaking.

It seems that the system tests are invoked using some Python

Yes, many are based on launch files, with a test script and a setup script (tester and the nav stack bringup). The launch API has a test API so that once the tester exits, it takes down the setup as well (e.g. nav stack). We use that for the system tests to launch the nav stack or a part of it we're interested in alongside a tester.py script.

mmattb · 2023-01-10T10:54:01Z

I was able to recreate the issue today, finally. Took some set up time :) I can reproduce the same basic cov numbers you linked me to, using:

colcon test --packages-select nav2_system_tests

Adding sleeps to TearDownTestCase in the relevant system tests didn't seem to give a chance for gcov to flush. Not sure if it flushes periodically or at process exit only. Next I'll try a manual __gcov_flush if it's straightfoward, which will bisect the problem nicely.

SteveMacenski · 2023-01-10T18:57:48Z

Great! I can't say how much I appreciate your work here. This is much, much deeper into gcov/lcov than I would have been able to go

mmattb · 2023-01-13T08:41:34Z

Returning to this today... I think I've effectively proven that the gcov dump process, triggered atexit by __gcov_exit(), is not running like it should. So Steve's theory seems correct. Using onRun() as a reference, I can add __gcov_exit there, which effectively guarantees we see the onRun call in progress (see screenshot below). Using log lines trivially proves the code is still being covered.

I don't really understand the ros2/launch or nav2 launch stuff yet; I just glanced at it briefly. Likewise with ROS node signal handlers etc. That's clearly where my focus should go next.

@SteveMacenski has anybody tried to bisect code changes on the ROS or Nav2 sides to see if this is due to a code change, or is the behavior simply different on 22.04? (I don't want to set up a 20.04 VM to find out). If there is a chance it's a code change, can you give me the hash of the last-known-good build?

mmattb · 2023-01-13T08:43:21Z

Uncaught exception in the tear down of the behavior server causing us to drop the process on the floor?

[43.062857] (nav2_system_tests) StdoutLine: {'line': b'9: [behavior_server-10] [INFO] [1673598507.589703428] [behavior_server]: Destroying\n'}
[43.062996] (nav2_system_tests) StdoutLine: {'line': b"9: [behavior_server-10] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'\n"}
[43.063104] (nav2_system_tests) StdoutLine: {'line': b"9: [behavior_server-10]   what():  Couldn't initialize rcl timer handle: the given context is not valid, either rcl_init() was not called or rcl_shutdown() was called., at ./src/rcl/guard_condition.c:67\n"}

It seems that there is a single global "default context". Attached to it is an rcl_context_t which we invalidate with a call to rcl_shutdown towards the end of the test. Then in the LifecycleNode destructor path somewhere we apparently try to use that same context again while creating a Timer, which is what is tripping the exception. Note the backtrace shows the Nav2 LifecycleNode destructor, not the ROS parent class destructor, but I'm not clearly seeing this code path yet. I'll look for it when I'm back online.

 0# rclcpp::TimerBase::TimerBase(std::shared_ptr<rclcpp::Clock>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::shared_ptr<rclcpp::Context>) in /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
 1# bond::Bond::deadpublishingTimerReset() in /opt/ros/rolling/lib/libbondcpp.so
 2# SM_Alive::Die(BondSMContext&) in /opt/ros/rolling/lib/libbondcpp.so
 3# bond::Bond::breakBond() in /opt/ros/rolling/lib/libbondcpp.so
 4# bond::Bond::~Bond() in /opt/ros/rolling/lib/libbondcpp.so
 5# nav2_util::LifecycleNode::~LifecycleNode() in /home/mbryan/Projects/ros/navigation2/install/nav2_util/lib/libnav2_util_core.so
 6# behavior_server::BehaviorServer::~BehaviorServer() in /home/mbryan/Projects/ros/navigation2/install/nav2_behaviors/lib/libbehavior_server_core.so
 7# std::_Sp_counted_ptr_inplace<behavior_server::BehaviorServer, std::allocator<behavior_server::BehaviorServer>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() in /home/mbryan/Projects/ros/navigation2/install/nav2_behaviors/lib/nav2_behaviors/behavior_server
 8# std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() in /home/mbryan/Projects/ros/navigation2/install/nav2_behaviors/lib/nav2_behaviors/behavior_server
 9# 0x000055F5FE8BB964 in /home/mbryan/Projects/ros/navigation2/install/nav2_behaviors/lib/nav2_behaviors/behavior_server
10# 0x00007FA05CCD9D90 in /lib/x86_64-linux-gnu/libc.so.6
11# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
12# 0x000055F5FE8BBA35 in /home/mbryan/Projects/ros/navigation2/install/nav2_behaviors/lib/nav2_behaviors/behavior_server

In summary: if the log lines are telling the truth, we are hitting an exception during shutdown, which could perhaps explain why we don't get a gcov dump with all the goods. Need to spelunk this Bond code. Additionally: we should consider whether our system tests ought to check for a 0 exit code here, since if this is a real bug we shouldn't be relying on declining code cov numbers to catch it.

mmattb · 2023-01-13T08:54:32Z

Also, could somebody link me to the test logs for one of the bunk code cov runs, and also for the last-known-good run? Specifically I want events.log that I see spit out in my sandbox runs, i.e. log/test_<timestamp>/events.log . That will let me self-help looking for differences between the runs, and double check whether the exception above is due to version mismatch in my sandbox.

SteveMacenski · 2023-01-13T19:39:48Z

can you give me the hash of the last-known-good build?

For the assisted teleop file you show, there has never been a 'good build' where the coverage was displayed properly, to my knowledge. I believe the first time I noticed, I added some prints to show that the functions were being called so I didn't hold it against the PR contributor since it wasn't something they did.

Uncaught exception in the tear down of the behavior server causing us to drop the process on the floor? if the log lines are telling the truth, we are hitting an exception during shutdown, which could perhaps explain why we don't get a gcov dump with all the goods

Semi-related question for my person education: I've used GDB to grab tracebacks when the cause is non-obvious in my development (my tutorial https://navigation.ros.org/tutorials/docs/get_backtrace.html) on crashes but haven't used it to try to debug shutdown failures before. Is there anything semantically different going on there or the same old? It honestly did not occur to me to apply the tool that way.

On topic: It is interesting to note as I did in the longer comment that when I switch from non-composed bringup (e.g. 1 server 1 process) to composed bringup (e.g. all servers in 1 process) we did get a ton of our coverage results back. So I'm curious to know if the exception that's being raised on shutdown is unique to the Behavior Server where we see that issue. If its unique to this server, then that means there's something we can do in the code to fix it. Also, why it would be that some of the behavior server tests give us results if others don't? Might be another clue.

Need to spelunk this Bond code

Bond basically creates a set of publishers and subscriptions between 2 servers to make sure each other are still alive by passing heartbeats. I only use it single-directionally that the Lifecycle Manager is checking if the <server> is still responding. If not, it might be in a deadlock condition or crashed, so then the Lifecycle Manager will bring down all servers under its control for safety. I believe I have elevated privileges on that repo as well, so if we find something, I can turn around a batch quickly.

In a given CI run in Circle you can go to the artifacts tab and it provides all of the files we cache (example from a nightly job yesterday). I see an events.log, but I'm not sure thats what you're looking for. We use this script in CI to run the code coverage collection https://github.com/ros-planning/navigation2/blob/main/tools/code_coverage_report.bash and it should be outputting a file total_coverage.info.

@ruffsl why would that file be missing? We run the artifacts collection stage before we run the Codecov collection so is that the single reason why or should we also move that directory to be something that the Artifacts are picking up?

mmattb · 2023-01-14T03:30:45Z

Typing on my phone, so apologies for the brevity. I'll look back at this in a few days.

Semi-related question for my person education: I've used GDB to grab tracebacks when the cause is non-obvious in my development (my tutorial https://navigation.ros.org/tutorials/docs/get_backtrace.html) on crashes but haven't used it to try to debug shutdown failures before. Is there anything semantically different going on there or the same old? It honestly did not occur to me to apply the tool that way.

You can probably use 'break exit' or 'break _exit' or some such, then step forward from there. You can breakpoint signal handlers too in that way, which is what you'd want here.
The backtrace I gave you above came from boost stacktrace, which I stuck in the Timer code. I ran the test without gdb then. Handy trick I found out!

On topic: It is interesting to note as I did in the longer comment that when I switch from non-composed bringup (e.g. 1 server 1 process) to composed bringup (e.g. all servers in 1 process) we did get a ton of our coverage results back. So I'm curious to know if the exception that's being raised on shutdown is unique to the Behavior Server where we see that issue. If its unique to this server, then that means there's something we can do in the code to fix it. Also, why it would be that some of the behavior server tests give us results if others don't? Might be another clue.

Yeah that is interesting. If my hypothesis is correct, then a.) We need to not be killing the global context before users of it have a chance to clean up; that would be a ROS bug I think; or b.) We need to ensure the bond code isnt making things during destruction, which may or may not be a Nav2 side fix. Whether or not this is our code cov issue, it certainly doesn't seem intended or correct? But I'm new here so im not sure!

mmattb · 2023-01-15T15:30:31Z

Yeah, so this isn't the fix, but it gets us nearly full coverage on the assisted_teleop test. This just drops the Bond on the ground and never breaks it properly, which presumably shouldn't be the real bug fix. What this demonstrates is that that Bond destruction is the source of the problem.

It looks like on the ROS side there is a "preshutdown callback" which can be registered on the global Context, which gives us an opportunity to break bonds properly before things start going sideways. That could be the real fix. Thank you to the authors for that mechanism; they added it in 2021. ros2/rclcpp#1706

diff --git a/nav2_util/src/lifecycle_node.cpp b/nav2_util/src/lifecycle_node.cpp
index 9868a38bbbe6cd29520162941623ec93702e3d08..ab24e103e5f92fd57354f48744913355aac31a08 100644
--- a/nav2_util/src/lifecycle_node.cpp
+++ b/nav2_util/src/lifecycle_node.cpp
@@ -41,11 +41,12 @@ LifecycleNode::LifecycleNode(
 LifecycleNode::~LifecycleNode()
 {
   RCLCPP_INFO(get_logger(), "Destroying");
   // In case this lifecycle node wasn't properly shut down, do it here
   if (get_current_state().id() ==
     lifecycle_msgs::msg::State::PRIMARY_STATE_ACTIVE)
   {
+    destroyBond();
     on_deactivate(get_current_state());
     on_cleanup(get_current_state());
   }
 }
@@ -67,9 +68,10 @@ void LifecycleNode::createBond()
 void LifecycleNode::destroyBond()
 {
   RCLCPP_INFO(get_logger(), "Destroying bond (%s) to lifecycle manager.", this->get_name());
 
   if (bond_) {
-    bond_.reset();
+    //bond_.reset();
+    bond_.release();
   }
 }

…tdown. Partial fix for Issue ros-navigation#3271

mmattb · 2023-01-16T12:41:25Z

There we go. I haven't fixed all of the Exceptions, which means there is still some code cov we aren't seeing. Looks like it's going to be a case of whack-a-mole unfortunately.

You can see the commit for this first portion of the fix here. I'll put a PR up after I whack the other instances I've seen, and after writing some unit tests, but before doing anything else with the tests to catch these.

mmattb@4bda9f8

ruffsl · 2023-01-16T15:19:51Z

@ruffsl why would that file be missing? We run the artifacts collection stage before we run the Codecov collection so is that the single reason why or should we also move that directory to be something that the Artifacts are picking up?

The collection and upload of code coverage are steps that are invoked after the workspace caching and artifact upload (of only logs and test result files) steps are invoked. This was done so that code coverage could be handled separately depending upon if the job was a release build or a debug build. The coverage info file can still be downloaded from the job's codecov report page, as that is where it's uploaded (and tagged by respective rmw). I didn't see the need to duplicate the archival of duplicate report files across separate hosting platforms.

SteveMacenski · 2023-01-17T20:35:41Z

Interestingly, this may be related ros2/rclcpp#2083 so worth bringing up on the composition-results-disappear part of things. The nasty kill from composition may be due to the container (or more likely, that we're holding onto shared pointer instances and they're not being reset on destruction properly).

We need to ensure the bond code isnt making things during destruction, which may or may not be a Nav2 side fix. Whether or not this is our code cov issue, it certainly doesn't seem intended or correct?

That seemingly makes sense 😆

destroyBond();

We should be calling that in all of the task server nodes on deactivation but maybe these experiments aren't bringing the lifecycle back down to a safe state? If so, then that destruction isn't called so adding it to the LifecycleNode's destructor seems like a good logical option so that its called either way. I believe in some of the main servers we also call some of the on_cleanup style things in the Class' Destructor for similar reasons.

 -    bond_.reset();
+    //bond_.reset();
+    bond_.release();

The release vs reset on the unique_ptr should have the same result for us either way

Thanks so much for your time on this! I'm super unclear still why we have some work and some don't in this case given that's a shared issue, but I won't argue with progress 🥇

mmattb · 2023-01-18T08:44:38Z

Great points @SteveMacenski. I will say: some of these are stochastic bugs for whatever reason. I see some of them every other run. It would also not shock me if gcov flushed on events other than just exit, s.t. we see some coverage sometimes.

The nasty kill from composition may be due to the container (or more likely, that we're holding onto shared pointer instances and they're not being reset on destruction properly).

I should perhaps relax the composition change soon and see what happens! Hopefully we don't have more shutdown bugs lurking than I've seen so far...

Now I've whacked four of these shutdown bugs, which is making me want to check back in with you all to make sure I'm not just chasing ghosts in my local clone here... I won't be shocked if you tell me that ROS and Nav2 never really prioritized getting clean shutdowns, and so we get these code cov issues as a result of that. I also won't be shocked if you tell me "Matthew, you aren't going to be able to fix all the shutdown issues" or "you're setting it up wrong" :)

source /path/to/ros2_rolling/install/setup.bash &&

source /path/to/navigation2/install/local_setup.bash &&

/path/to/navigation2/install/nav2_costmap_2d/lib/nav2_costmap_2d/nav2_costmap_2d_cloud --ros-args -r __node:=costmap_2d_cloud -r voxel_grid:=local_costmap/voxel_grid

When I ran this, then used ctrl+c, I got a seg fault (stack below). Applying this change, the seg fault goes away:

diff --git a/nav2_costmap_2d/src/costmap_2d_cloud.cpp b/nav2_costmap_2d/src/costmap_2d_cloud.cpp
index 8c40709b..488fad88 100644
--- a/nav2_costmap_2d/src/costmap_2d_cloud.cpp
+++ b/nav2_costmap_2d/src/costmap_2d_cloud.cpp
@@ -225,6 +225,12 @@ int main(int argc, char ** argv)
     "voxel_grid", rclcpp::SystemDefaultsQoS(), voxelCallback);
 
   rclcpp::spin(g_node->get_node_base_interface());
+
+  // Needed for shutdown() to not puke
+  g_node.reset();
+  pub_marked.reset();
+  pub_unknown.reset();
+
   rclcpp::shutdown();
 
   return 0;

Could somebody else confirm that behavior, i.e. that they see the segfault on shutdown, and don't with my change?

I see at least one more of these issues to whack. If you all think I'm not chasing ghosts here, I can keep tracking down at least some of the others.

Thread 1 "nav2_costmap_2d" received signal SIGSEGV, Segmentation fault.
___pthread_mutex_lock (mutex=0x7e0) at ./nptl/pthread_mutex_lock.c:80
80	./nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  ___pthread_mutex_lock (mutex=0x7e0) at ./nptl/pthread_mutex_lock.c:80
#1  0x00007ffff624e99e in eprosima::fastdds::dds::DomainParticipantImpl::find_type(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/fastrtps/lib/libfastrtps.so.2.9
#2  0x00007ffff6254f36 in eprosima::fastdds::dds::DomainParticipantImpl::unregister_type(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/fastrtps/lib/libfastrtps.so.2.9
#3  0x00007ffff690c910 in rmw_fastrtps_shared_cpp::remove_topic_and_type(CustomParticipantInfo const*, eprosima::fastdds::dds::TopicDescription const*, eprosima::fastdds::dds::TypeSupport const&) () from /home/mbryan/Projects/ros/ros2_rolling/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#4  0x00007ffff68e5436 in rmw_fastrtps_shared_cpp::__rmw_destroy_service(char const*, rmw_node_s*, rmw_service_s*) ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#5  0x00007ffff69bcec1 in rmw_destroy_service () from /home/mbryan/Projects/ros/ros2_rolling/install/rmw_fastrtps_cpp/lib/librmw_fastrtps_cpp.so
#6  0x00007ffff6c63a7f in rmw_destroy_service () from /home/mbryan/Projects/ros/ros2_rolling/install/rmw_implementation/lib/librmw_implementation.so
#7  0x00007ffff71ab617 in rcl_service_fini () from /home/mbryan/Projects/ros/ros2_rolling/install/rcl/lib/librcl.so
#8  0x00007ffff7cc52a7 in rclcpp::Service<rcl_interfaces::srv::ListParameters>::Service(std::shared_ptr<rcl_node_s>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::AnyServiceCallback<rcl_interfaces::srv::ListParameters>, rcl_service_options_s&)::{lambda(rcl_service_s*)#1}::operator()(rcl_service_s*) const ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#9  0x00007ffff7cdadf4 in std::_Sp_counted_deleter<rcl_service_s*, rclcpp::Service<rcl_interfaces::srv::ListParameters>::Service(std::shared_ptr<rcl_node_s>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::AnyServiceCallback<rcl_interfaces::srv::ListParameters>, rcl_service_options_s&)::{lambda(rcl_service_s*)#1}, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#10 0x0000555555592592 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#11 0x00007ffff7b4de35 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#12 0x00007ffff7bbac4c in std::__shared_ptr<rcl_service_s, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#13 0x00007ffff7bbac6c in std::shared_ptr<rcl_service_s>::~shared_ptr() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#14 0x00007ffff7cc2290 in rclcpp::ServiceBase::~ServiceBase() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#15 0x00007ffff7cdc0fa in rclcpp::Service<rcl_interfaces::srv::ListParameters>::~Service() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#16 0x00007ffff7ce027b in void __gnu_cxx::new_allocator<rclcpp::Service<rcl_interfaces::srv::ListParameters> >::destroy<rclcpp::Service<rcl_interfaces::srv::ListParameters> >(rclcpp::Service<rcl_interfaces::srv::ListParameters>*) () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#17 0x00007ffff7cdd2bd in void std::allocator_traits<std::allocator<rclcpp::Service<rcl_interfaces::srv::ListParameters> > >::destroy<rclcpp::Service<rcl_interfaces::srv::ListParameters> >(std::allocator<rclcpp::Service<rcl_interfaces::srv::ListParameters> >&, rclcpp::Service<rcl_interfaces::srv::ListParameters>*) ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#18 0x00007ffff7cdb74d in std::_Sp_counted_ptr_inplace<rclcpp::Service<rcl_interfaces::srv::ListParameters>, std::allocator<rclcpp::Service<rcl_interfaces::srv::ListParameters> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#19 0x0000555555592592 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#20 0x00007ffff7b4de35 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#21 0x00007ffff7bf5f42 in std::__shared_ptr<rclcpp::Service<rcl_interfaces::srv::ListParameters>, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#22 0x00007ffff7bf5f62 in std::shared_ptr<rclcpp::Service<rcl_interfaces::srv::ListParameters> >::~shared_ptr() ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#23 0x00007ffff7bf5f86 in rclcpp::ParameterService::~ParameterService() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#24 0x00007ffff7c18fc0 in void __gnu_cxx::new_allocator<rclcpp::ParameterService>::destroy<rclcpp::ParameterService>(rclcpp::ParameterService*) ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#25 0x00007ffff7c18d31 in void std::allocator_traits<std::allocator<rclcpp::ParameterService> >::destroy<rclcpp::ParameterService>(std::allocator<rclcpp::ParameterService>&, rclcpp::ParameterService*) () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#26 0x00007ffff7c186bf in std::_Sp_counted_ptr_inplace<rclcpp::ParameterService, std::allocator<rclcpp::ParameterService>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#27 0x0000555555592592 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#28 0x00007ffff7b4de35 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#29 0x00007ffff7bf63a2 in std::__shared_ptr<rclcpp::ParameterService, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#30 0x00007ffff7bf63c2 in std::shared_ptr<rclcpp::ParameterService>::~shared_ptr() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#31 0x00007ffff7bedac8 in rclcpp::node_interfaces::NodeParameters::~NodeParameters() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#32 0x00007ffff7bedb58 in rclcpp::node_interfaces::NodeParameters::~NodeParameters() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#33 0x00007ffff7bd74c4 in std::_Sp_counted_ptr<rclcpp::node_interfaces::NodeParameters*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#34 0x0000555555592592 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#35 0x00007ffff7b4de35 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#36 0x00007ffff7bce0be in std::__shared_ptr<rclcpp::node_interfaces::NodeParametersInterface, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#37 0x00007ffff7bd1648 in std::__shared_ptr<rclcpp::node_interfaces::NodeParametersInterface, (__gnu_cxx::_Lock_policy)2>::reset() ()
   from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#38 0x00007ffff7bcc14f in rclcpp::Node::~Node() () from /home/mbryan/Projects/ros/ros2_rolling/install/rclcpp/lib/librclcpp.so
#39 0x0000555555587d1d in std::_Sp_counted_ptr_inplace<rclcpp::Node, std::allocator<rclcpp::Node>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
#40 0x000055555558dc39 in std::shared_ptr<rclcpp::Node>::~shared_ptr() ()
#41 0x00007ffff6ce3495 in __run_exit_handlers (status=0, listp=0x7ffff6eb7838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at ./stdlib/exit.c:113
#42 0x00007ffff6ce3610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#43 0x00007ffff6cc7d97 in __libc_start_call_main (main=main@entry=0x555555580860 <main>, argc=argc@entry=6, argv=argv@entry=0x7ffffffec158) at ../sysdeps/nptl/libc_start_call_main.h:74
#44 0x00007ffff6cc7e40 in __libc_start_main_impl (main=0x555555580860 <main>, argc=6, argv=0x7ffffffec158, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffffffec148) at ../csu/libc-start.c:392
#45 0x00005555555812d5 in _start ()

Needed to ensure code coverage flushes. See Issue ros-navigation#3271.

One of several clean shutdown issues being address via Issue ros-navigation#3271.

mmattb · 2023-01-18T09:06:35Z

We should be calling that in all of the task server nodes on deactivation but maybe these experiments aren't bringing the lifecycle back down to a safe state? If so, then that destruction isn't called so adding it to the LifecycleNode's destructor seems like a good logical option so that its called either way.

That's exactly the idea, yeah, except that the destructor fires too late (after rclcpp::shutdown()) in the case that we are letting the dtor fire at scope end. So: we need to leverage preshutdown callbacks to do the cleanup where onDeactivate isn't being called (which is presumably some of these SIGINT cases).

Feel free to add me as some sort of Collaborator with restricted permissions if you want to assign this to me!

…tdown. Partial fix for Issue ros-navigation#3271

Needed to ensure code coverage flushes. See Issue ros-navigation#3271.

One of several clean shutdown issues being address via Issue ros-navigation#3271.

mmattb · 2023-01-18T13:54:07Z

  0 [1169.514017] (nav2_system_tests) StdoutLine: {'line': b"27: [ERROR] [component_container_isolated-4]: process[component_cont    ainer_isolated-4] failed to terminate '10.0' seconds after receiving 'SIGTERM', escalating to 'SIGKILL'\n"}
  1 [1169.515275] (nav2_system_tests) StdoutLine: {'line': b"27: [INFO] [component_container_isolated-4]: sending signal 'SIGKILL    ' to process[component_container_isolated-4]\n"}

With composition=True I seem to hit ros2/rclcpp#2083. So it seems we have at least two issues, and the composition change worked around one of them.

SteveMacenski · 2023-01-18T19:46:13Z

I won't be shocked if you tell me that ROS and Nav2 never really prioritized getting clean shutdowns, and so we get these code cov issues as a result of that

Pre-composition, we did 😄. Post-composition, I've been a little preoccupied with other package developments that I didn't have the time to surface and look into them since no one reported an issue with it (yet).

costmap_2d_cloud.cpp

I've literally never used some of those standalone nodes in Costmap2D before, they're legacy from ROS 1 ported here. They're inarguably useful, so they stay around, but I rarely find a reason when I need to visualize the costmaps in that way. So any issues with those stand alone nodes its likely that very few people interact with them to be aware. The "main servers" in the launch files are what's typically used.


+  // Needed for shutdown() to not puke
+  g_node.reset();
+  pub_marked.reset();
+  pub_unknown.reset();

I know in classes, the default destructor will destroy things in reverse order stored in the classes. I would assume (?) that the same is true for globals, though I program with globals nearly never so I've never asked myself that question. You may get a simila result just by reversing the ordering they're declared in when the main() scope ends. Though I think you mentioned its about shutdown() semantics, not just cleaning up of resources. I'm not sure why that would seg fault. The only difference between that and a "normal" node is that the publisher/subscribers are in the scope scope as the node creation on shutdown (unless even normal nodes do this too).

Could somebody else confirm that behavior, i.e. that they see the segfault on shutdown, and don't with my change?

I'm recompiling a clean workspace now from some update API breaks, I'll let you know in like an hour.

That's exactly the idea, yeah, except that the destructor fires too late (after rclcpp::shutdown()) in the case that we are letting the dtor fire at scope end. So: we need to leverage preshutdown callbacks to do the cleanup where onDeactivate isn't being called (which is presumably some of these SIGINT cases).

That seems.... very very problematic considering not all ROS nodes are lifecycle nodes to have the deactivation/cleanup stages. I don't think you should see this behavior when you use ros2 launch nav2_bringup tb3_simulation_launch.py use_composition:=False where it would not use the component containers and launch each node in separate processes. I think that should give you a clean exit even without lifecycle transitions.

Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>

One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>

SteveMacenski · 2023-01-20T23:02:11Z

OK! Will do. Hopefully if I get a few minutes today, I'll go through and fix the Theta* issue. That popped up about 10-14 days ago. I'll go through and fix that + the annoying mergify popup that incorrectly identifies messed up formatting of the PR header.

There are some massive coverage misses that are new and not usual in that PR

…tdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com>

For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>