-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codecov metrics for Assisted Teleop / Drive on Heading behaviors not being picked up #3271
Comments
Just a small hunch, I think by default |
No, the system tests call the behavior directly |
I was playing around this afternoon in #3305. Forward declaring template class did not help (I suspected something odd was happening with the template type patterns), nor did matching the public/private functions + destructor file location (not that I thought that would help, but it was one of the very things different about the 2 failing plugins from the others). The files look to me to be exactly the same otherwise. As well with the system tests, only difference I can find is drive on heading uses |
Thanks to Steve for pointing this issue out to me on the discussion forums. I'd love to help out, since I've touched code coverage mechanics before in other contexts. Can somebody point me to some materials I can use to get bootstrapped on this? e.g. what packages and what tests are involved, how do I install and run them, what does your workflow for gathering the code cov info look like? Meanwhile, I'll install nav2 and try to play with it some in the next couple of days. |
Hi, Oh my god, what a savior! Let me start with some context and resources, then I'll answer your specifics. So this all started when we noticed when we moved to 22.04/Humble (and Rolling's cycle to 22.04 too) that our coverage dropped significantly due to missing coverage with what seemed to be most (if not all) of the Nav2 System Tests. The unit testing in each package seemed to be fine. We didn't really understand why. I thrashed around a bit and found that if I tested the stack with After that, I noticed that basically the only thing I could easily identify now as missing was in the behavior server, namely the assisted teleop and drive on heading plugins, which oddly enough are also the 2 newest plugins, by years. We have system tests which cover them but for some reason some or many of the functions that are being called from them aren't getting picked up. I tried a bunch of stuff to see but after 3 days I ran out of time and had to move on to complete my 2022 goals and my plan of action for continuing is above to try a few desperate last-ditch things to see if I could identify where the issue lies (in the behavior package, or the system tests). Resources I've been using:
In addition to the resources above, just the usual build / install instructions for Nav2/ROS2. Install Rolling binaries on 22.04 + Nav2 from source in a workspace These are some instructions for executing individual tests and generating a report using lcov ros-navigation/docs.nav2.org#88 (comment) - though as I said, I never actually tried using that for this particular issue (yet) so I don't actually know what that would report is any different. |
Ah I see. I've seen a bunch of things which can look like this. Metrics not flushing atexit seemed like the most common one, which indeed could indicate something strange with the process unwind/atexit/exit. Another had to do with weird dynamic linking details. Another is the code coverage tool not adapting to certain optimizations, macros, or code gen, but this code on its face looks too vanilla to be running afoul of those. I'm working on getting the tests running on my local machine. My intuition is to grep the symbols in the binary and the codecov microdata to bisect the problem. I might end up tinkering with the shutdown/killing process since that does seems like a good starting point. When somebody doesn't have control over a strange shutdown sequence, they may sometimes conditionally compile in manual flushes in some places. That's a big hammer, but in our case it could give us a way to diagnose the problem. StackOverflow example. Who can help me get to the bottom of why the tests just spin forever around setting the initial pose? Any hints? I was able to invoke the unit tests using the tutorial, and to get code cov data from those (great tutorial!). It seems that the system tests are invoked using some Python. I stripped the command out so I can run the test directly, but It just spins on the initial pose. I double checked all the paths in the arguments and they seem to point to real things. Before this I built the production and test code using the code cov flags from the tutorial. Help?
I'll be on a train all day tomorrow; will try to pick this up later. |
Something to note too looking at the code is that the DriveOnHeading and the BackUp behaviors both share a ton of code (BackUp is derived from DriveOnHeading). Of the duplicated code that exists, its mostly to enforce protections on the backup version, so doing an apples-to-apples comparison of them is possible and as far as I can tell are the exact same. Yet one provides the results and the other is missing some.
That sounds like gazebo might not be running, it might be worth running those tests via the cmake instead of invoking them manually, but that's mostly my "I don't really know test/coverage tools that well, so don't break what works" speaking.
Yes, many are based on launch files, with a test script and a setup script (tester and the nav stack bringup). The launch API has a test API so that once the tester exits, it takes down the setup as well (e.g. nav stack). We use that for the system tests to launch the nav stack or a part of it we're interested in alongside a tester.py script. |
I was able to recreate the issue today, finally. Took some set up time :) I can reproduce the same basic cov numbers you linked me to, using:
Adding sleeps to TearDownTestCase in the relevant system tests didn't seem to give a chance for gcov to flush. Not sure if it flushes periodically or at process exit only. Next I'll try a manual __gcov_flush if it's straightfoward, which will bisect the problem nicely. |
Great! I can't say how much I appreciate your work here. This is much, much deeper into gcov/lcov than I would have been able to go |
Returning to this today... I think I've effectively proven that the gcov dump process, triggered atexit by __gcov_exit(), is not running like it should. So Steve's theory seems correct. Using onRun() as a reference, I can add __gcov_exit there, which effectively guarantees we see the onRun call in progress (see screenshot below). Using log lines trivially proves the code is still being covered. I don't really understand the ros2/launch or nav2 launch stuff yet; I just glanced at it briefly. Likewise with ROS node signal handlers etc. That's clearly where my focus should go next. @SteveMacenski has anybody tried to bisect code changes on the ROS or Nav2 sides to see if this is due to a code change, or is the behavior simply different on 22.04? (I don't want to set up a 20.04 VM to find out). If there is a chance it's a code change, can you give me the hash of the last-known-good build? |
Uncaught exception in the tear down of the behavior server causing us to drop the process on the floor?
It seems that there is a single global "default context". Attached to it is an rcl_context_t which we invalidate with a call to rcl_shutdown towards the end of the test. Then in the LifecycleNode destructor path somewhere we apparently try to use that same context again while creating a Timer, which is what is tripping the exception. Note the backtrace shows the Nav2 LifecycleNode destructor, not the ROS parent class destructor, but I'm not clearly seeing this code path yet. I'll look for it when I'm back online.
In summary: if the log lines are telling the truth, we are hitting an exception during shutdown, which could perhaps explain why we don't get a gcov dump with all the goods. Need to spelunk this Bond code. Additionally: we should consider whether our system tests ought to check for a 0 exit code here, since if this is a real bug we shouldn't be relying on declining code cov numbers to catch it. |
Also, could somebody link me to the test logs for one of the bunk code cov runs, and also for the last-known-good run? Specifically I want events.log that I see spit out in my sandbox runs, i.e. |
For the assisted teleop file you show, there has never been a 'good build' where the coverage was displayed properly, to my knowledge. I believe the first time I noticed, I added some prints to show that the functions were being called so I didn't hold it against the PR contributor since it wasn't something they did.
Semi-related question for my person education: I've used GDB to grab tracebacks when the cause is non-obvious in my development (my tutorial https://navigation.ros.org/tutorials/docs/get_backtrace.html) on crashes but haven't used it to try to debug shutdown failures before. Is there anything semantically different going on there or the same old? It honestly did not occur to me to apply the tool that way. On topic: It is interesting to note as I did in the longer comment that when I switch from non-composed bringup (e.g. 1 server 1 process) to composed bringup (e.g. all servers in 1 process) we did get a ton of our coverage results back. So I'm curious to know if the exception that's being raised on shutdown is unique to the Behavior Server where we see that issue. If its unique to this server, then that means there's something we can do in the code to fix it. Also, why it would be that some of the behavior server tests give us results if others don't? Might be another clue.
Bond basically creates a set of publishers and subscriptions between 2 servers to make sure each other are still alive by passing heartbeats. I only use it single-directionally that the Lifecycle Manager is checking if the In a given CI run in Circle you can go to the @ruffsl why would that file be missing? We run the artifacts collection stage before we run the Codecov collection so is that the single reason why or should we also move that directory to be something that the Artifacts are picking up? |
Typing on my phone, so apologies for the brevity. I'll look back at this in a few days.
You can probably use 'break exit' or 'break _exit' or some such, then step forward from there. You can breakpoint signal handlers too in that way, which is what you'd want here.
Yeah that is interesting. If my hypothesis is correct, then a.) We need to not be killing the global context before users of it have a chance to clean up; that would be a ROS bug I think; or b.) We need to ensure the bond code isnt making things during destruction, which may or may not be a Nav2 side fix. Whether or not this is our code cov issue, it certainly doesn't seem intended or correct? But I'm new here so im not sure! |
Yeah, so this isn't the fix, but it gets us nearly full coverage on the assisted_teleop test. This just drops the Bond on the ground and never breaks it properly, which presumably shouldn't be the real bug fix. What this demonstrates is that that Bond destruction is the source of the problem. It looks like on the ROS side there is a "preshutdown callback" which can be registered on the global Context, which gives us an opportunity to break bonds properly before things start going sideways. That could be the real fix. Thank you to the authors for that mechanism; they added it in 2021. ros2/rclcpp#1706
|
…tdown. Partial fix for Issue ros-navigation#3271
There we go. I haven't fixed all of the Exceptions, which means there is still some code cov we aren't seeing. Looks like it's going to be a case of whack-a-mole unfortunately. You can see the commit for this first portion of the fix here. I'll put a PR up after I whack the other instances I've seen, and after writing some unit tests, but before doing anything else with the tests to catch these. |
The collection and upload of code coverage are steps that are invoked after the workspace caching and artifact upload (of only logs and test result files) steps are invoked. This was done so that code coverage could be handled separately depending upon if the job was a release build or a debug build. The coverage info file can still be downloaded from the job's codecov report page, as that is where it's uploaded (and tagged by respective rmw). I didn't see the need to duplicate the archival of duplicate report files across separate hosting platforms. |
Interestingly, this may be related ros2/rclcpp#2083 so worth bringing up on the composition-results-disappear part of things. The nasty kill from composition may be due to the container (or more likely, that we're holding onto shared pointer instances and they're not being reset on destruction properly).
That seemingly makes sense 😆
We should be calling that in all of the task server nodes on deactivation but maybe these experiments aren't bringing the lifecycle back down to a safe state? If so, then that destruction isn't called so adding it to the
The release vs reset on the unique_ptr should have the same result for us either way Thanks so much for your time on this! I'm super unclear still why we have some work and some don't in this case given that's a shared issue, but I won't argue with progress 🥇 |
Great points @SteveMacenski. I will say: some of these are stochastic bugs for whatever reason. I see some of them every other run. It would also not shock me if gcov flushed on events other than just exit, s.t. we see some coverage sometimes.
I should perhaps relax the composition change soon and see what happens! Hopefully we don't have more shutdown bugs lurking than I've seen so far... Now I've whacked four of these shutdown bugs, which is making me want to check back in with you all to make sure I'm not just chasing ghosts in my local clone here... I won't be shocked if you tell me that ROS and Nav2 never really prioritized getting clean shutdowns, and so we get these code cov issues as a result of that. I also won't be shocked if you tell me "Matthew, you aren't going to be able to fix all the shutdown issues" or "you're setting it up wrong" :)
When I ran this, then used ctrl+c, I got a seg fault (stack below). Applying this change, the seg fault goes away:
Could somebody else confirm that behavior, i.e. that they see the segfault on shutdown, and don't with my change? I see at least one more of these issues to whack. If you all think I'm not chasing ghosts here, I can keep tracking down at least some of the others.
|
Needed to ensure code coverage flushes. See Issue ros-navigation#3271.
One of several clean shutdown issues being address via Issue ros-navigation#3271.
That's exactly the idea, yeah, except that the destructor fires too late (after rclcpp::shutdown()) in the case that we are letting the dtor fire at scope end. So: we need to leverage preshutdown callbacks to do the cleanup where onDeactivate isn't being called (which is presumably some of these SIGINT cases). Feel free to add me as some sort of Collaborator with restricted permissions if you want to assign this to me! |
…tdown. Partial fix for Issue ros-navigation#3271
Needed to ensure code coverage flushes. See Issue ros-navigation#3271.
One of several clean shutdown issues being address via Issue ros-navigation#3271.
With composition=True I seem to hit ros2/rclcpp#2083. So it seems we have at least two issues, and the composition change worked around one of them. |
Pre-composition, we did 😄. Post-composition, I've been a little preoccupied with other package developments that I didn't have the time to surface and look into them since no one reported an issue with it (yet).
I've literally never used some of those standalone nodes in Costmap2D before, they're legacy from ROS 1 ported here. They're inarguably useful, so they stay around, but I rarely find a reason when I need to visualize the costmaps in that way. So any issues with those stand alone nodes its likely that very few people interact with them to be aware. The "main servers" in the launch files are what's typically used.
I know in classes, the default destructor will destroy things in reverse order stored in the classes. I would assume (?) that the same is true for globals, though I program with globals nearly never so I've never asked myself that question. You may get a simila result just by reversing the ordering they're declared in when the
I'm recompiling a clean workspace now from some update API breaks, I'll let you know in like an hour.
That seems.... very very problematic considering not all ROS nodes are lifecycle nodes to have the deactivation/cleanup stages. I don't think you should see this behavior when you use |
Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
OK! Will do. Hopefully if I get a few minutes today, I'll go through and fix the Theta* issue. That popped up about 10-14 days ago. I'll go through and fix that + the annoying mergify popup that incorrectly identifies messed up formatting of the PR header. There are some massive coverage misses that are new and not usual in that PR |
…tdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com>
For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
…tdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com>
For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
…tdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com>
For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
…tdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com>
For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
…tdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com>
For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com>
* Ensure that we break bonds in lifecycle node/manager prior to rcl_shutdown. Partial fix for Issue #3271 Signed-off-by: mbryan <matthew.bryan@dell.com> * Add unit tests for lifecycle_node rcl shutdown cb. For Issue #3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure planner_server shuts down cleanly on SIGINT. Needed to ensure code coverage flushes. See Issue #3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure costmap_2d_cloud does not seg fault on SIGINT. One of several clean shutdown issues being address via Issue #3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Fix flake8 issues in some test files Signed-off-by: mbryan <matthew.bryan@dell.com> --------- Signed-off-by: mbryan <matthew.bryan@dell.com> Co-authored-by: mbryan <matthew.bryan@dell.com>
Thanks to @mmattb for some really awesome debugging and fixes that took us all the way to rclcpp with some issues! |
It was a pleasure Steve! Thanks for letting me jump in. |
) (ros-navigation#3377) * Ensure that we break bonds in lifecycle node/manager prior to rcl_shutdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com> * Add unit tests for lifecycle_node rcl shutdown cb. For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure planner_server shuts down cleanly on SIGINT. Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure costmap_2d_cloud does not seg fault on SIGINT. One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Fix flake8 issues in some test files Signed-off-by: mbryan <matthew.bryan@dell.com> --------- Signed-off-by: mbryan <matthew.bryan@dell.com> Co-authored-by: mbryan <matthew.bryan@dell.com>
) (ros-navigation#3377) * Ensure that we break bonds in lifecycle node/manager prior to rcl_shutdown. Partial fix for Issue ros-navigation#3271 Signed-off-by: mbryan <matthew.bryan@dell.com> * Add unit tests for lifecycle_node rcl shutdown cb. For Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure planner_server shuts down cleanly on SIGINT. Needed to ensure code coverage flushes. See Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Ensure costmap_2d_cloud does not seg fault on SIGINT. One of several clean shutdown issues being address via Issue ros-navigation#3271. Signed-off-by: mbryan <matthew.bryan@dell.com> * Fix flake8 issues in some test files Signed-off-by: mbryan <matthew.bryan@dell.com> --------- Signed-off-by: mbryan <matthew.bryan@dell.com> Co-authored-by: mbryan <matthew.bryan@dell.com>
CC @jwallace42
This is especially odd since the other plugins are getting their metrics reported correctly (wait, spin, backup) and backup uses alot of drive on heading as part of its code. I can't seem to find any reason for this after a couple of hours of initial debugging.
The text was updated successfully, but these errors were encountered: