Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address code cov loss due to unclean shutdown (issue #3271) #3377

Merged
merged 5 commits into from
Feb 1, 2023

Conversation

mmattb
Copy link
Contributor

@mmattb mmattb commented Jan 20, 2023


Basic Info

Info Please fill out this column
Ticket(s) this addresses 3271
Primary OS tested on Ubuntu 22.04
Robotic platform tested on nav2_system_tests (i.e. turtlebot3_waffle?)

Description of contribution in a few bullet points

In brief: unclean shutdown of several processes resulted in code coverage data being dropped on the floor, rather than flushed to the .gcda files. So to get the code coverage data, we fix the unclean shutdowns. There are a few instances fixed in this PR, but likely not all (see Future work below).

The gist of all these fixes is that rclcpp::shutdown() invalidates a bunch of state, and if something attempts to touch that state again after the shutdown(), exceptions and seg faults can occur, resulting in unclean shutdown. rclcpp provides a preshutdown callback mechanism for use cases like these. The fixes are:

  • Add a preshutdown callback to lifecycle_node which tosses bond_. If a lifecycle_node is destructed after shutdown() while the bond_ still exists, the Bond destructor will fire, which touches invalidated state. The result is an RCLError in the destructor code path. The fix is to drop the bond_ prior to shutdown, i.e. in the preshutdown callback.
  • Subclass the preshutdown callback in planner_server in order to drop additional state, for the same reason.
  • Address a segfault in nav2_costmap_2d_cloud main() by manual destructing some global state prior to shutdown(). This one could not be reproduced on @SteveMacenski's machine, so it may be version dependent? Not sure.

The first two of these appeared in the circleCI output during code cov runs, as well as in my local sandbox, and fixing them drastically improved code coverage on the Nav2 plugins. I have not spotted the 3rd of these in the circleCI logs (though I haven't tried very hard), but it's low enough risk to justify, reasoning that this is probably version dependent somehow.

Description of documentation updates required from your changes

None, unless we want to give an example of a preshutdown callback?


Future work that may be required in bullet points

  • Additional shutdown bugs are likely lurking, but they seem hard to pin down, and it's unclear what is due to this developer's local setup.
  • We should probably change our nav2_system_tests to fail if unexpected exit codes are encountered, in order to catch the issues in the prior bullet point.
  • Because of the shutdown issues or other reasons, we may still be missing code coverage somewhere. Additional work could be done to find where. This present fix addresses Assisted Teleop and Drive-on-Heading coverage issues, and likely many others, but no effort was made to be systematic about finding all such issues.

For Maintainers:

  • Check that any new parameters added are updated in navigation.ros.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

@mmattb mmattb changed the title Address code cov loss due to inclean shutdown (issue #3271) Address code cov loss due to unclean shutdown (issue #3271) Jan 20, 2023
@mergify
Copy link
Contributor

mergify bot commented Jan 20, 2023

@mmattb, please properly fill in PR template in the future. @SteveMacenski, use this instead.

  • Check that any new parameters added are updated in navigation.ros.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

@mmattb
Copy link
Contributor Author

mmattb commented Jan 20, 2023

I don't have permissions to set labels on this PR, but anyway please consider this one a WiP. I'm working out the uncrustification, and I'll take a stab at one more shutdown issue that I see in a blue moon.

@mmattb
Copy link
Contributor Author

mmattb commented Jan 20, 2023

Mixed results. Some tests failed. Are these the intermittent failures you noted in the Issue @SteveMacenski? If not, I have some chasing to do...

@SteveMacenski
Copy link
Member

This one could not be reproduced on @SteveMacenski's machine, so it may be version dependent? Not sure.

What version are you on? I'm using Rolling on 22.04.

Are these the intermittent failures you noted in the Issue @SteveMacenski? If not, I have some chasing to do...

The Theta* test failure is something that popped up very recently this year. I'll get to that today hopefully to fix. The SingleTriggerTestFixture is caused by a regression in BT.CPP who's fix is propagating through releases already.

Looking at the codecov results from this, there are a ton of code now no longer covered - which is unusual. Its not unusual for our coverage to iterate a few tenths of a percent from run to run, but this is pretty significant with files losing ~60%+ of their coverage metrics.

@mmattb
Copy link
Contributor Author

mmattb commented Jan 21, 2023

Good to know. I think that waypoint follower blip stands out. It comes and goes; two runs ago it was well covered. I'll investigate.

What version are you on? I'm using Rolling on 22.04

Also Rolling/22.04, but I have to guess that one of my deps are versioned differently? Because I didn't touch anything in that costmap code path I don't think. I'll do more due diligence though. Will get back to this in the next day or two.

@mmattb
Copy link
Contributor Author

mmattb commented Jan 21, 2023

It looks like the waypoint follower miss does happen sometimes outside of my branch. It's not consistent, even on my branch:

https://app.codecov.io/gh/ros-planning/navigation2/pull/3374

Likewise with oscillation.cpp:

https://app.codecov.io/gh/ros-planning/navigation2/pull/3372

Clearly the fixes I have here help, but are not complete. Shall I open issues for these other ones?

@mmattb
Copy link
Contributor Author

mmattb commented Jan 21, 2023

re: costmap2d: with a fresh clone of both ros2_rolling and navigation2:main today (i.e. without my fix applied) I still hit the segfault. My best guess it that my apt-get installed libbondcpp is different? And maybe other external deps? But I'm grasping at straws since I didn't try alternate versions of anything else. I'm okay to back that one out if it seems strange to you, but luckily I think it's low risk.

@SteveMacenski
Copy link
Member

It looks like the waypoint follower miss does happen sometimes outside of my branch. It's not consistent, even on my branch

If you're saying that run is a fluke, I'll retrigger CI and see if it goes away. I haven't noted that blip before, but if its irregular (e.g. not common, but happens on rare occasions) then that's something I can live with, at least for now. I just retriggered CI too see to.

I'm not sure about the costmap part, I use bare metal (since Nav2 development is just my day-to-day) with binaries provided by the Rolling distribution that have been updated. Maybe as part of an apt update there was something that didn't quite sit right with your workspace cache? I know the issue reported in #3364 is caused by a workspace cache becoming invalidated after updating binaries with an ABI breaking change.

I'm okay to back that one out if it seems strange to you, but luckily I think it's low risk.

I don't think it hurts anyone, I'd just remove the comment

@mmattb
Copy link
Contributor Author

mmattb commented Jan 24, 2023

If you're saying that run is a fluke, I'll retrigger CI and see if it goes away. I haven't noted that blip before, but if its irregular (e.g. not common, but happens on rare occasions) then that's something I can live with, at least for now. I just retriggered CI too see to.

I think the waypoint follower coverage regression not a fluke, but not related to my change afaict. Do we always run the same set of tests, and always the same way?

The new run on my PR doesn't show the issue: https://app.codecov.io/gh/ros-planning/navigation2/pull/3377

The runs on some recent unrelated CRs do:
https://app.codecov.io/gh/ros-planning/navigation2/pull/3374
https://app.codecov.io/gh/ros-planning/navigation2/pull/3371

Looks like with this merge the waypoint follower coverage drastically increased, from a level similar to the one seen in my PR. Seems like a moving target for some reason.
https://app.codecov.io/gh/ros-planning/navigation2/pull/3372

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 24, 2023

For Theta*, I think those PRs need to be rebased. I fixed a bug in the Theta* tests and merged at the end of last week.

Looking at the other parts of that run, we're talking about changes of +/- 2 lines of code, which is within the margin of error for the results of coverage due to the system tests. DWB and Costmap2D are legacy packages that did not have very good unit test coverage when they were adopted and their coverage is largely on system-level testing rather than unit level function testing (unlike the rest of the packages). Scanning through the changes, they appear to be almost exclusively in those packages, but also +/- 2 lines per file that is marked as loss.

So those diffs you see aren't jitter, those are PRs that need to be rebased for the Theta* coverage to be returned. If you rebase your PR, you should see that go away too. Whatever that jitter is is another problem if it was touching Waypoint Follower (which hasn't substantively changed in some time now)

@mmattb
Copy link
Contributor Author

mmattb commented Jan 24, 2023

If something like this is truly necessary, I think we need to escalate this to rclcpp that we cannot get clean exits from lifecycle nodes at all (which guarantees normal nodes can't either) - this seems like a serious issue. Something seems fishy to me though that this would be necessary
Something doesn't seem quite right to me that this is a ROS 2-wide issue.

I think it's actually specific to Nav2's use of Bonds in its LifecycleNode child class. I don't think it affects all of ROS2. The bond I'm cleaning up in my PR is a Nav2 thing. But I agree the shutdown sequence seems needlessly difficult to get right. It would be better if ROS2 made it easier to drop things on the floor instead of throwing exceptions after shutdown() has been called. Maybe just e.g. logging at the WARN level. Know any ROS2 people we can tag or email to ask about this? I can present the stacks/etc.

I'll rebase and see if the issue repros both with/without composition.

@mmattb
Copy link
Contributor Author

mmattb commented Jan 24, 2023

I know the issue reported in #3364 is caused by a workspace cache becoming invalidated after updating binaries with an ABI breaking change.

Could be! I wish libbondcpp was brought into my ros2_rolling clone when I did a vcs import! Maybe I need to manually build that guy and update it. That's the only stuff from the apt install I see in my stacks.

Anyway...comment removed!

@mmattb
Copy link
Contributor Author

mmattb commented Jan 24, 2023

Running with composition and without my fix we see the same RCLError that the preshutdown callback addresses. At least the tests don't time out anymore! Running with composition and with my fix, the errors are indeed gone, and I don't see any more.

At any rate, I don't know why ROS2 raises exceptions for these various cases, so @SteveMacenski if you want to pursue a conversation with the ROS2 devs, I'd love a pointer to the right people! Maybe you can tag them here?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 24, 2023

I wish libbondcpp was brought into my ros2_rolling clone when I did a vcs import!

Sorry about that, our vcs file for nav2 has all of the dependencies commented out (if you uncomment them, all should work). We have it that way since our CI uses that file to check for dependencies to build in an underlay workspace if we have temporarily un-released dependencies. That's a legacy of before ROS 2 had a build farm and binaries were available for anything (and rolling did not exist). I left the file intact with the contents commented out in case of a "break glass" moment and to keep tracking what we depend on above a base install.

You should be able to use binaries for all nav2 dependencies. It would probably be good to use the binaries too just in case there are outdated issues (unless you're actively trying to modify the libraries, which it sounds like you may have with Bond!). Are there changes we can add to bond instead potentially? It seems odd that our destroying bond would be the cause of an issue - just resetting a shared pointer - unless bond's destruction sequence was really the problem.

Maybe you can tag them here?

I do, but first, did you test using the rclcpp new container updates after ros2/rclcpp#2085? I don't want to poke them about getting involved if the issue was actually resolved ahead of us. If the issue is composition isolated, then this could be it (unless you showed it happens with other non-isolated container types).


I know in this PR you went looking for the composition testing issues, but curious if there are the behavior server fixes here too? I don't see anything particular about the behavior server and the missing overage we had there - this seems more general for composition -- unless its all the same cause?

@mmattb
Copy link
Contributor Author

mmattb commented Jan 25, 2023

Are there changes we can add to bond instead potentially? It seems odd that our destroying bond would be the cause of an issue - just resetting a shared pointer - unless bond's destruction sequence was really the problem.

I haven't spelunked the Bond code to find out. It certainly seems odd that one would be constructing something in a destructor path (a ROS2 TimerBase)! Yet that is clearly what is happening. You can see the stack I pasted in the Issue.

Having said that - the planner_server issue is only partially Bond related AFAIK, but is similar in form: invalidated ROS state is still being used after ROS shutdown. That is due to a 'planner_' thread still running. That issue looks like:

[1384.354782] (nav2_system_tests) StdoutLine: {'line': b"31: [planner_server-12] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'\n"}
[1384.355242] (nav2_system_tests) StdoutLine: {'line': b'31: [planner_server-12]   what():  failed to send response: client w    ill not receive response, at /home/mbryan/Projects/ros/ros2_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_response.cpp:149, at /home/mbryan/Projects/ros/ros2_rolling/src/ros2/rcl/rcl/src/rcl/service.c:314\n'}

Instead of taking my fix, another approach is to always wrap these node types in code blocks which will exit prior to shutdown(). Example:

int run_planner_server()
{
  auto ps = std::make_shared<nav2_planner::PlannerServer>();
  
  ...
  
}  // ps dtor will fire here, before the return, and therefore before shutdown

int main()
{
  int exitcode {0};
  
  rclcpp::init(blah blah);
  
  exitcode = run_planner_server();
  
  rclcpp::shutdown();
  return exitcode;
}

The downside to that approach is that it reduces the flexibility in how these nodes are used. Example: the user will need to copy any information out of the instance which they want to keep prior to shutdown() being called, since the dtor will have fired by then. That's also a bit of a landmine waiting to be stepped on.

I know in this PR you went looking for the composition testing issues, but curious if there are the behavior server fixes here too? I don't see anything particular about the behavior server and the missing overage we had there - this seems more general for composition -- unless its all the same cause?

Indeed it's the same cause: this issue can in theory affect everything which inherits from nav2_util::LifecycleNode, which BehaviorServer does; that's the connection. Having said that: if an instance of the inherited class is destructed prior to shutdown() you won't see the exception (see alternate solution above).

vcs file for nav2 has all of the dependencies commented out

Good to know!

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 25, 2023

It certainly seems odd that one would be constructing something in a destructor path (a ROS2 TimerBase)! Yet that is clearly what is happening.

But ideally if we cycle the lifecycle down to deactivated before shutdown, the bond object is reset so there shouldn't be any pointer to anything bond related when we shutdown? I understand that's an issue if we don't properly transition down the lifecycle first, but I think you said that it happens either way which I'm not clear on how that's possible that bonds causing an issue (except in the lifecycle manager server, who doesn't have it reset if you do that lifecycle shutdown manually - but should if you do it through the lifecycle manager via the destroyBondTimers method).

If you said that transitioning down things work properly, then I'm on board. If you're saying even in that state bond still poses an issue, I'm a little suspicious. But perhaps this is cleanup that the PR in rclcpp would resolve for us? There was some context manipulation going on there so it could be related.

planner_ thread? Edit: Oh, that's in the system tests, not the planner server itself, got it.

Instead of taking my fix, another approach is to always wrap these node types in code blocks which will exit prior to shutdown().

We don't get that choice in components ☹️ Unless we changed the procedure in rclcpp_components, which is totally on the table.

Indeed it's the same cause

Got it!

@mmattb
Copy link
Contributor Author

mmattb commented Jan 26, 2023

But ideally if we cycle the lifecycle down to deactivated before shutdown, the bond object is reset so there shouldn't be any pointer to anything bond related when we shutdown?

Maybe there is a better way to shut the node down than I'm doing, but we still need something to trigger that shutdown. Right now it's being triggered by the signal handler. ROS controls the signal handler, and so we need to figure out how to trigger our shutdown sequence in the path of that signal handler. ROS provides the preshutdown callback for that. That is: the order of events, which are all controlled by ROS, are: signal received -> preshutdown callbacks -> rclcpp::shutdown() -> shutdown callbacks -> exit.

So: we can ask "What should we do in the preshutdown callback?" The PR takes the quickest exit to the door, which is just to tear down the aspects of the instance's state which we trip on in shutdown.

But! I'm not sure what would be cleanest here, due to lack of experience. Should we be going through the lifecycle transition instead, e.g. calling on_deactivate() and on_cleanup()? That's possible to do instead. That solution could maybe look like:

  • Keep the preshutdown callback
  • Change the preshutdown callback to call on_deactivate() and on_cleanup() (or whatever the right way to transition is)
  • Preshutdown callback continues to call destroyBond() just in case the child class didn't; that's to avoid this becoming a land mine for users.
  • PlannerServer's on_cleanup already resets planner_, so we can drop the preshutdown callback from PlannerServer if we call on_cleanup from the super class.

Would that be preferable? If so, what is the right way to trigger those transitions? Just calling the callbacks? It would maybe give a cleaner way for the child classes to clean up nicely, if they need to?

planner_ thread?

Sorry I misremembered/typo'd. I mean the planners_ and costmap_thread_ here, which now I'm destroying in the callback: https://github.com/ros-planning/navigation2/blob/main/nav2_planner/src/planner_server.cpp#L72

@mmattb
Copy link
Contributor Author

mmattb commented Jan 26, 2023

Let me know if a quick video chat helps :-D I'm in Indochina / Thailand time right now :-D

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 27, 2023

Maybe there is a better way to shut the node down than I'm doing, but we still need something to trigger that shutdown. Right now it's being triggered by the signal handler

Ah, I see. Some explanation of lifecycle might be good. Lifecycle nodes have the state transitions for various things (e.g. configuring, activating, deactivating, cleaning up). When you use the autostart feature of Nav2, it will automatically transition the nodes up from the unconfigured state to the active state to process data in deterministic order. There's a complimentary shutdown you can do as well which will bring the system from active -> deactivate -> inactive -> cleanup -> unconfigured state. The deactive function you can see in the lifecycle node implementations have the destroy bond function which resets the pointer. If you use the lifecycle manager to do this, then the manager also has its bond timers destroyed in the complementary function in reset/shutdown/etc.

So when you're using the signal handler to kill the program, you're not using any of that, you're just killing it in the active state and no attempt is made to cycle the system down. That's a suboptimal way to kill the servers, but honestly super common so I'm all on board with making that exit cleanly as well. In this case, on_deactivate is not called thus the bond stuff is still around, thus the non-clean problems we've noted. If you shut things down through the lifecycle transition before hitting control+C, then on_deactivate should be called and then the bond stuff should all be reset to null pointers. This is where I would expect you to find no issue at all -- similiar to if you were to hit control+C on a server that you executed on the commandline without cycling the lifecycle up for the bond stuff to be setup in on_activate in the first place.

So... that leads us to then the question to make sure that you do indeed see on your side that when you transition down the lifecycle nodes or don't try to activate the server(s) at all, you don't see those bond issues (since in both situations, bond_ should be a nullptr). If you see that all working out well and its only on the un-lifecycle-shutdown program exits we see the problem, that is what I would expect given the rest of the conversation we've been having.


So on top of that understanding, if we find that what I say above is true and our goal is to have clean shutdowns from the active state of the lifecycle nodes (for you know... the 80% of users that I suspect do that because its easy) that's where this pre-shutdown stuff can be helpful -- especially if we can essentially hide it from the users in our nav2_util::LifecycleNode class without impacting the server's host code. You mention we can just call on_deactivate and on_cleanup there, which are both already virtual methods of the lifecycle node. That sounds like a perfect solution, but maybe also check the state of the lifecycle node to see if that's even necessary (e.g. if the user did properly shutdown the lifecycle nodes, then this wouldn't need to do anything). We can do that via get_current_state to check that we're in the finalized or unconfigured states.

If we did that, we also wouldn't need to inject any code in the planner server since the planners_.clear() is in the cleanup stage as well https://github.com/ros-planning/navigation2/blob/562d9ac7c25256731a937adb17dffe278b3682db/nav2_planner/src/planner_server.cpp#L240. Though, it looks like we missed the costmap thread one, so we'll have to add 1 line to on_cleanup to do that as well 😅

Myself and the Intel guys that wrote Nav2 initially with me apparently did not know that the destructors wouldn't properly cleanup resources when control+C was called as you have found. With this understanding and use of the pre-shutdown callbacks, we can just make all the destructors default since not matter what, we'll end up lifecycling the nodes down before the context is deleted.


Lots of words:

  • I think we have a misunderstanding on the nature of lifecycle, try that test out and make sure you see what I think you should see
  • If it works, then we can use your proposed pre-shutdown callbacks to properly handle the lifecycle transitions for the user if they call control+C without bringing the lifecycle state down. If we check the current state, we can also make sure we only call the deactivation / cleanup as properly needed in case the server isn't active yet anyway.
  • That should (?) resolve the issues

It doesn't resolve the fact that bond doesn't destruct cleanly. If we can make a basic unit test for that to figure out why and there's a solution, it would be good to actually just fix bond too. The destructor looks like its doing alot of lifting which doesn't seem appropriate or apparently functional. Maybe it makes sense to use that pre-shutdown callback API here as well.

@mmattb
Copy link
Contributor Author

mmattb commented Jan 27, 2023

That's great info @SteveMacenski ! Helps a bunch. So I tried a deactivate then cleanup via the preshutdown callbacks today, and hit some deadlock deep in the publisher on_deactivate. It's been pretty tough to debug. Specifically we hang here:

https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L302

That is, in PlannerServer (also ControllerServer), we call on_deactivate on a costmap handle, which then calls on_deactivate on the publisher at the line above. At that point we hang.

And if you comment out that line, we hang on the one after it. So it smells like there is a deadlock involving a pub on_deactivate while in the preshutdown context. It's possible that it's just slow, not a deadlock, and we eventually hit a timeout. Anyway, so it's smelling like this approach won't work without some more work... The ROS shutdown sequence has turned out to have quite a few land mines to step on. I'll spend more time to see if I can find the specific deadlock / slow thing, but it smells like it will be a poor use of time.

@ros-navigation ros-navigation deleted a comment from mergify bot Jan 27, 2023
@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 27, 2023

I'm going to make an educated guess that its because the footpritn publisher is used in this thread https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L450 which isn't shutdown before the publisher is deactivated - though you didn't show me the trace so I'm just guessing. Though I don't know why that would by special otherwise.

Seems like https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L309-L313 should be before the publisher shutdowns since that's a shared resource with the thread that needs to be shutdown first. A reorder might do the trick. I'm really embarrassed by these issues, I can't believe how badly I've mucked up the shutdown procedures 🤦

The next line you mention also is used in that thread, so that would make sense to see the same problem

@mmattb
Copy link
Contributor Author

mmattb commented Jan 28, 2023

Experimenting with a fix along those lines. There are some additional checks I need to do in a few places so that the ROS2 lifecycle_node code isn't so noisy about deactivations that we shouldn't be taking. This solution leveraging deactivate and cleanup is likely to be a lot more complex, and will require some special casing around state that we aren't doing today. You can see some examples I've just added: extra 'if' statements in some places. There will be more before I'm done.

@mmattb
Copy link
Contributor Author

mmattb commented Jan 29, 2023

Seems okay now? So this is my second attempt that leverages the lifecycle transitions. It seems to work. There is still an intermittent failure involving publishers doing something during the shutdown process. I haven't chased after that one because it seems we are getting good coverage reliably. How does it look?

Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally no concerns beyond duplication of some elements that I'd like to have only a single copy of for clarity if its not required

This all looks like a great solution to me!

nav2_controller/src/controller_server.cpp Show resolved Hide resolved
this->cleanup();
}

if (rcl_preshutdown_cb_handle_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about this line about why its required to remove the callback?

Copy link
Contributor Author

@mmattb mmattb Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See prior response. This is to prevent a use after free in the case that the callback is called after the destructor fires. At that point the instance has evaporated, so there is no 'this' to bind the call to -> seg fault. I found this out the hard way :)

Btw any races b/w the unregister and a callback firing are handled by a lock in the preshutdown callback code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent a use after free in the case that the callback is called after the destructor fires.

... the node doesn't remove callbacks after the node that created them is out of scope?! At the very least, there should be a check if that object even is still valid in the program lifecycle. Seems like a low-hanging contribution to rclcpp - but I understand why you did this here now and all good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; somebody would need to add callback handle ptr members to the base Node class, and unregister the callbacks in the Node dtor. That would be a ROS2 request :)

Copy link
Member

@SteveMacenski SteveMacenski Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open that ticket on rclcpp and the bond not shutting down on context shutdown cleanly and CC me?

Regardless, all this is good and can stay, but this should be made public so that someone can fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that someone probably ultimately to be me in like 6 months)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! To me that smells like a libbondcpp bug as a first pass; i.e. why the heck are we constructing things in the destructor path? I wasn't able to CC you for some reason, but here it is: ros/bond_core#92

I additionally filed a bug for us to catch process deaths in the system tests, so we don't need to rely on declining code coverage to find future issues: #3391

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file the there is no 'this' to bind the call to -> seg fault ticket as well for the shutdown callbacks not checking if the node is even still alive? I can escalate the Bond + 2 rclcpp issues in the weekly ROS 2 core libs meeting (though looks like Michael's already seen the 2 you filed).

nav2_util/src/lifecycle_node.cpp Outdated Show resolved Hide resolved
mbryan added 5 commits February 1, 2023 13:04
…tdown.

Partial fix for Issue ros-navigation#3271

Signed-off-by: mbryan <matthew.bryan@dell.com>
For Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>
Needed to ensure code coverage flushes. See Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>
Signed-off-by: mbryan <matthew.bryan@dell.com>
@SteveMacenski SteveMacenski merged commit 4aaefad into ros-navigation:main Feb 1, 2023
@SteveMacenski
Copy link
Member

Awesome work, I really appreciate your detailed look and attention here! If you can file that last ticket on the shutdown callbacks not checking the node is still in existence that would be appreciated to button up any additional issues to expose back to the core libraries.

If you're open to working on another project or fishing for some other issues you ran into during this excursion, let me know! I'm happy for any assistance you're open to and would enjoy the puzzle to contribute.

@mmattb
Copy link
Contributor Author

mmattb commented Feb 4, 2023

Awesome work, I really appreciate your detailed look and attention here! If you can file that last ticket on the shutdown callbacks not checking the node is still in existence that would be appreciated to button up any additional issues to expose back to the core libraries.

If you're open to working on another project or fishing for some other issues you ran into during this excursion, let me know! I'm happy for any assistance you're open to and would enjoy the puzzle to contribute.

Thanks a bunch Steve. It was great to jump in on it. I have a paper to finish editing in the next month, then maybe I can jump back in (and start my job search again). Any leads on Dev positions that don't require robotics expertise, let me know:-)

@SteveMacenski
Copy link
Member

What kind of dev positions? I can see if I know anyone. They're all going to be robotics companies but most jobs at a robotics company aren't straight up robotics algorithm development.

@mmattb
Copy link
Contributor Author

mmattb commented Feb 12, 2023

What kind of dev positions? I can see if I know anyone. They're all going to be robotics companies but most jobs at a robotics company aren't straight up robotics algorithm development.

Shot you an email @SteveMacenski !

andrewlycas pushed a commit to StratomInc/navigation2 that referenced this pull request Feb 23, 2023
) (ros-navigation#3377)

* Ensure that we break bonds in lifecycle node/manager prior to rcl_shutdown.

Partial fix for Issue ros-navigation#3271

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Add unit tests for lifecycle_node rcl shutdown cb.

For Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Ensure planner_server shuts down cleanly on SIGINT.

Needed to ensure code coverage flushes. See Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Ensure costmap_2d_cloud does not seg fault on SIGINT.

One of several clean shutdown issues being address via Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Fix flake8 issues in some test files

Signed-off-by: mbryan <matthew.bryan@dell.com>

---------

Signed-off-by: mbryan <matthew.bryan@dell.com>
Co-authored-by: mbryan <matthew.bryan@dell.com>
@SteveMacenski SteveMacenski mentioned this pull request Nov 20, 2023
7 tasks
savalena pushed a commit to savalena/navigation2 that referenced this pull request Jul 5, 2024
) (ros-navigation#3377)

* Ensure that we break bonds in lifecycle node/manager prior to rcl_shutdown.

Partial fix for Issue ros-navigation#3271

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Add unit tests for lifecycle_node rcl shutdown cb.

For Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Ensure planner_server shuts down cleanly on SIGINT.

Needed to ensure code coverage flushes. See Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Ensure costmap_2d_cloud does not seg fault on SIGINT.

One of several clean shutdown issues being address via Issue ros-navigation#3271.

Signed-off-by: mbryan <matthew.bryan@dell.com>

* Fix flake8 issues in some test files

Signed-off-by: mbryan <matthew.bryan@dell.com>

---------

Signed-off-by: mbryan <matthew.bryan@dell.com>
Co-authored-by: mbryan <matthew.bryan@dell.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants