-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address code cov loss due to unclean shutdown (issue #3271) #3377
Conversation
@mmattb, please properly fill in PR template in the future. @SteveMacenski, use this instead.
|
I don't have permissions to set labels on this PR, but anyway please consider this one a WiP. I'm working out the uncrustification, and I'll take a stab at one more shutdown issue that I see in a blue moon. |
Mixed results. Some tests failed. Are these the intermittent failures you noted in the Issue @SteveMacenski? If not, I have some chasing to do... |
What version are you on? I'm using Rolling on 22.04.
The Theta* test failure is something that popped up very recently this year. I'll get to that today hopefully to fix. The Looking at the codecov results from this, there are a ton of code now no longer covered - which is unusual. Its not unusual for our coverage to iterate a few tenths of a percent from run to run, but this is pretty significant with files losing ~60%+ of their coverage metrics. |
Good to know. I think that waypoint follower blip stands out. It comes and goes; two runs ago it was well covered. I'll investigate.
Also Rolling/22.04, but I have to guess that one of my deps are versioned differently? Because I didn't touch anything in that costmap code path I don't think. I'll do more due diligence though. Will get back to this in the next day or two. |
It looks like the waypoint follower miss does happen sometimes outside of my branch. It's not consistent, even on my branch: https://app.codecov.io/gh/ros-planning/navigation2/pull/3374 Likewise with oscillation.cpp: https://app.codecov.io/gh/ros-planning/navigation2/pull/3372 Clearly the fixes I have here help, but are not complete. Shall I open issues for these other ones? |
re: costmap2d: with a fresh clone of both ros2_rolling and navigation2:main today (i.e. without my fix applied) I still hit the segfault. My best guess it that my apt-get installed libbondcpp is different? And maybe other external deps? But I'm grasping at straws since I didn't try alternate versions of anything else. I'm okay to back that one out if it seems strange to you, but luckily I think it's low risk. |
If you're saying that run is a fluke, I'll retrigger CI and see if it goes away. I haven't noted that blip before, but if its irregular (e.g. not common, but happens on rare occasions) then that's something I can live with, at least for now. I just retriggered CI too see to. I'm not sure about the costmap part, I use bare metal (since Nav2 development is just my day-to-day) with binaries provided by the Rolling distribution that have been updated. Maybe as part of an apt update there was something that didn't quite sit right with your workspace cache? I know the issue reported in #3364 is caused by a workspace cache becoming invalidated after updating binaries with an ABI breaking change.
I don't think it hurts anyone, I'd just remove the comment |
I think the waypoint follower coverage regression not a fluke, but not related to my change afaict. Do we always run the same set of tests, and always the same way? The new run on my PR doesn't show the issue: https://app.codecov.io/gh/ros-planning/navigation2/pull/3377 The runs on some recent unrelated CRs do: Looks like with this merge the waypoint follower coverage drastically increased, from a level similar to the one seen in my PR. Seems like a moving target for some reason. |
For Theta*, I think those PRs need to be rebased. I fixed a bug in the Theta* tests and merged at the end of last week. Looking at the other parts of that run, we're talking about changes of +/- 2 lines of code, which is within the margin of error for the results of coverage due to the system tests. DWB and Costmap2D are legacy packages that did not have very good unit test coverage when they were adopted and their coverage is largely on system-level testing rather than unit level function testing (unlike the rest of the packages). Scanning through the changes, they appear to be almost exclusively in those packages, but also +/- 2 lines per file that is marked as loss. So those diffs you see aren't jitter, those are PRs that need to be rebased for the Theta* coverage to be returned. If you rebase your PR, you should see that go away too. Whatever that jitter is is another problem if it was touching Waypoint Follower (which hasn't substantively changed in some time now) |
I think it's actually specific to Nav2's use of Bonds in its LifecycleNode child class. I don't think it affects all of ROS2. The bond I'm cleaning up in my PR is a Nav2 thing. But I agree the shutdown sequence seems needlessly difficult to get right. It would be better if ROS2 made it easier to drop things on the floor instead of throwing exceptions after I'll rebase and see if the issue repros both with/without composition. |
Could be! I wish libbondcpp was brought into my ros2_rolling clone when I did a vcs import! Maybe I need to manually build that guy and update it. That's the only stuff from the apt install I see in my stacks. Anyway...comment removed! |
Running with composition and without my fix we see the same RCLError that the preshutdown callback addresses. At least the tests don't time out anymore! Running with composition and with my fix, the errors are indeed gone, and I don't see any more. At any rate, I don't know why ROS2 raises exceptions for these various cases, so @SteveMacenski if you want to pursue a conversation with the ROS2 devs, I'd love a pointer to the right people! Maybe you can tag them here? |
Sorry about that, our vcs file for nav2 has all of the dependencies commented out (if you uncomment them, all should work). We have it that way since our CI uses that file to check for dependencies to build in an underlay workspace if we have temporarily un-released dependencies. That's a legacy of before ROS 2 had a build farm and binaries were available for anything (and rolling did not exist). I left the file intact with the contents commented out in case of a "break glass" moment and to keep tracking what we depend on above a base install. You should be able to use binaries for all nav2 dependencies. It would probably be good to use the binaries too just in case there are outdated issues (unless you're actively trying to modify the libraries, which it sounds like you may have with Bond!). Are there changes we can add to bond instead potentially? It seems odd that our destroying bond would be the cause of an issue - just resetting a shared pointer - unless bond's destruction sequence was really the problem.
I do, but first, did you test using the rclcpp new container updates after ros2/rclcpp#2085? I don't want to poke them about getting involved if the issue was actually resolved ahead of us. If the issue is composition isolated, then this could be it (unless you showed it happens with other non-isolated container types). I know in this PR you went looking for the composition testing issues, but curious if there are the behavior server fixes here too? I don't see anything particular about the behavior server and the missing overage we had there - this seems more general for composition -- unless its all the same cause? |
I haven't spelunked the Bond code to find out. It certainly seems odd that one would be constructing something in a destructor path (a ROS2 TimerBase)! Yet that is clearly what is happening. You can see the stack I pasted in the Issue. Having said that - the planner_server issue is only partially Bond related AFAIK, but is similar in form: invalidated ROS state is still being used after ROS shutdown. That is due to a 'planner_' thread still running. That issue looks like:
Instead of taking my fix, another approach is to always wrap these node types in code blocks which will exit prior to shutdown(). Example:
The downside to that approach is that it reduces the flexibility in how these nodes are used. Example: the user will need to copy any information out of the instance which they want to keep prior to shutdown() being called, since the dtor will have fired by then. That's also a bit of a landmine waiting to be stepped on.
Indeed it's the same cause: this issue can in theory affect everything which inherits from
Good to know! |
But ideally if we cycle the lifecycle down to deactivated before shutdown, the If you said that transitioning down things work properly, then I'm on board. If you're saying even in that state bond still poses an issue, I'm a little suspicious. But perhaps this is cleanup that the PR in rclcpp would resolve for us? There was some
We don't get that choice in components
Got it! |
Maybe there is a better way to shut the node down than I'm doing, but we still need something to trigger that shutdown. Right now it's being triggered by the signal handler. ROS controls the signal handler, and so we need to figure out how to trigger our shutdown sequence in the path of that signal handler. ROS provides the preshutdown callback for that. That is: the order of events, which are all controlled by ROS, are: signal received -> preshutdown callbacks -> rclcpp::shutdown() -> shutdown callbacks -> exit. So: we can ask "What should we do in the preshutdown callback?" The PR takes the quickest exit to the door, which is just to tear down the aspects of the instance's state which we trip on in shutdown. But! I'm not sure what would be cleanest here, due to lack of experience. Should we be going through the lifecycle transition instead, e.g. calling
Would that be preferable? If so, what is the right way to trigger those transitions? Just calling the callbacks? It would maybe give a cleaner way for the child classes to clean up nicely, if they need to?
Sorry I misremembered/typo'd. I mean the |
Let me know if a quick video chat helps :-D I'm in Indochina / Thailand time right now :-D |
Ah, I see. Some explanation of lifecycle might be good. Lifecycle nodes have the state transitions for various things (e.g. configuring, activating, deactivating, cleaning up). When you use the autostart feature of Nav2, it will automatically transition the nodes up from the unconfigured state to the active state to process data in deterministic order. There's a complimentary shutdown you can do as well which will bring the system from active -> deactivate -> inactive -> cleanup -> unconfigured state. The deactive function you can see in the lifecycle node implementations have the destroy bond function which resets the pointer. If you use the lifecycle manager to do this, then the manager also has its bond timers destroyed in the complementary function in reset/shutdown/etc. So when you're using the signal handler to kill the program, you're not using any of that, you're just killing it in the active state and no attempt is made to cycle the system down. That's a suboptimal way to kill the servers, but honestly super common so I'm all on board with making that exit cleanly as well. In this case, So... that leads us to then the question to make sure that you do indeed see on your side that when you transition down the lifecycle nodes or don't try to activate the server(s) at all, you don't see those bond issues (since in both situations, So on top of that understanding, if we find that what I say above is true and our goal is to have clean shutdowns from the active state of the lifecycle nodes (for you know... the 80% of users that I suspect do that because its easy) that's where this pre-shutdown stuff can be helpful -- especially if we can essentially hide it from the users in our If we did that, we also wouldn't need to inject any code in the planner server since the Myself and the Intel guys that wrote Nav2 initially with me apparently did not know that the destructors wouldn't properly cleanup resources when control+C was called as you have found. With this understanding and use of the pre-shutdown callbacks, we can just make all the destructors default since not matter what, we'll end up lifecycling the nodes down before the context is deleted. Lots of words:
It doesn't resolve the fact that bond doesn't destruct cleanly. If we can make a basic unit test for that to figure out why and there's a solution, it would be good to actually just fix bond too. The destructor looks like its doing alot of lifting which doesn't seem appropriate or apparently functional. Maybe it makes sense to use that pre-shutdown callback API here as well. |
That's great info @SteveMacenski ! Helps a bunch. So I tried a deactivate then cleanup via the preshutdown callbacks today, and hit some deadlock deep in the publisher on_deactivate. It's been pretty tough to debug. Specifically we hang here: That is, in PlannerServer (also ControllerServer), we call on_deactivate on a costmap handle, which then calls on_deactivate on the publisher at the line above. At that point we hang. And if you comment out that line, we hang on the one after it. So it smells like there is a deadlock involving a pub on_deactivate while in the preshutdown context. It's possible that it's just slow, not a deadlock, and we eventually hit a timeout. Anyway, so it's smelling like this approach won't work without some more work... The ROS shutdown sequence has turned out to have quite a few land mines to step on. I'll spend more time to see if I can find the specific deadlock / slow thing, but it smells like it will be a poor use of time. |
I'm going to make an educated guess that its because the footpritn publisher is used in this thread https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L450 which isn't shutdown before the publisher is deactivated - though you didn't show me the trace so I'm just guessing. Though I don't know why that would by special otherwise. Seems like https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L309-L313 should be before the publisher shutdowns since that's a shared resource with the thread that needs to be shutdown first. A reorder might do the trick. I'm really embarrassed by these issues, I can't believe how badly I've mucked up the shutdown procedures 🤦 The next line you mention also is used in that thread, so that would make sense to see the same problem |
Experimenting with a fix along those lines. There are some additional checks I need to do in a few places so that the ROS2 lifecycle_node code isn't so noisy about deactivations that we shouldn't be taking. This solution leveraging deactivate and cleanup is likely to be a lot more complex, and will require some special casing around state that we aren't doing today. You can see some examples I've just added: extra 'if' statements in some places. There will be more before I'm done. |
Seems okay now? So this is my second attempt that leverages the lifecycle transitions. It seems to work. There is still an intermittent failure involving publishers doing something during the shutdown process. I haven't chased after that one because it seems we are getting good coverage reliably. How does it look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally no concerns beyond duplication of some elements that I'd like to have only a single copy of for clarity if its not required
This all looks like a great solution to me!
this->cleanup(); | ||
} | ||
|
||
if (rcl_preshutdown_cb_handle_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious about this line about why its required to remove the callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See prior response. This is to prevent a use after free in the case that the callback is called after the destructor fires. At that point the instance has evaporated, so there is no 'this' to bind the call to -> seg fault. I found this out the hard way :)
Btw any races b/w the unregister and a callback firing are handled by a lock in the preshutdown callback code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prevent a use after free in the case that the callback is called after the destructor fires.
... the node doesn't remove callbacks after the node that created them is out of scope?! At the very least, there should be a check if that object even is still valid in the program lifecycle. Seems like a low-hanging contribution to rclcpp - but I understand why you did this here now and all good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; somebody would need to add callback handle ptr members to the base Node class, and unregister the callbacks in the Node dtor. That would be a ROS2 request :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open that ticket on rclcpp and the bond not shutting down on context shutdown cleanly and CC me?
Regardless, all this is good and can stay, but this should be made public so that someone can fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that someone probably ultimately to be me in like 6 months)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! To me that smells like a libbondcpp bug as a first pass; i.e. why the heck are we constructing things in the destructor path? I wasn't able to CC you for some reason, but here it is: ros/bond_core#92
I additionally filed a bug for us to catch process deaths in the system tests, so we don't need to rely on declining code coverage to find future issues: #3391
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you file the there is no 'this' to bind the call to -> seg fault
ticket as well for the shutdown callbacks not checking if the node is even still alive? I can escalate the Bond + 2 rclcpp issues in the weekly ROS 2 core libs meeting (though looks like Michael's already seen the 2 you filed).
…tdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com>
For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
Signed-off-by: mbryan <matthew.bryan@dell.com>
Awesome work, I really appreciate your detailed look and attention here! If you can file that last ticket on the shutdown callbacks not checking the node is still in existence that would be appreciated to button up any additional issues to expose back to the core libraries. If you're open to working on another project or fishing for some other issues you ran into during this excursion, let me know! I'm happy for any assistance you're open to and would enjoy the puzzle to contribute. |
Thanks a bunch Steve. It was great to jump in on it. I have a paper to finish editing in the next month, then maybe I can jump back in (and start my job search again). Any leads on Dev positions that don't require robotics expertise, let me know:-) |
What kind of dev positions? I can see if I know anyone. They're all going to be robotics companies but most jobs at a robotics company aren't straight up robotics algorithm development. |
Shot you an email @SteveMacenski ! |
) (ros-navigation#3377) * Ensure that we break bonds in lifecycle node/manager prior to rcl_shutdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com> * Add unit tests for lifecycle_node rcl shutdown cb. For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure planner_server shuts down cleanly on SIGINT. Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure costmap_2d_cloud does not seg fault on SIGINT. One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Fix flake8 issues in some test files Signed-off-by: mbryan <matthew.bryan@dell.com> --------- Signed-off-by: mbryan <matthew.bryan@dell.com> Co-authored-by: mbryan <matthew.bryan@dell.com>
) (ros-navigation#3377) * Ensure that we break bonds in lifecycle node/manager prior to rcl_shutdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com> * Add unit tests for lifecycle_node rcl shutdown cb. For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure planner_server shuts down cleanly on SIGINT. Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure costmap_2d_cloud does not seg fault on SIGINT. One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Fix flake8 issues in some test files Signed-off-by: mbryan <matthew.bryan@dell.com> --------- Signed-off-by: mbryan <matthew.bryan@dell.com> Co-authored-by: mbryan <matthew.bryan@dell.com>
Basic Info
Description of contribution in a few bullet points
In brief: unclean shutdown of several processes resulted in code coverage data being dropped on the floor, rather than flushed to the .gcda files. So to get the code coverage data, we fix the unclean shutdowns. There are a few instances fixed in this PR, but likely not all (see Future work below).
The gist of all these fixes is that rclcpp::shutdown() invalidates a bunch of state, and if something attempts to touch that state again after the shutdown(), exceptions and seg faults can occur, resulting in unclean shutdown. rclcpp provides a preshutdown callback mechanism for use cases like these. The fixes are:
lifecycle_node
which tossesbond_
. If a lifecycle_node is destructed aftershutdown()
while thebond_
still exists, theBond
destructor will fire, which touches invalidated state. The result is an RCLError in the destructor code path. The fix is to drop thebond_
prior to shutdown, i.e. in the preshutdown callback.nav2_costmap_2d_cloud main()
by manual destructing some global state prior to shutdown(). This one could not be reproduced on @SteveMacenski's machine, so it may be version dependent? Not sure.The first two of these appeared in the circleCI output during code cov runs, as well as in my local sandbox, and fixing them drastically improved code coverage on the Nav2 plugins. I have not spotted the 3rd of these in the circleCI logs (though I haven't tried very hard), but it's low enough risk to justify, reasoning that this is probably version dependent somehow.
Description of documentation updates required from your changes
None, unless we want to give an example of a preshutdown callback?
Future work that may be required in bullet points
For Maintainers: