-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial skeleton for LTTng tracing in navigation2 #2788
Conversation
Signed-off-by: Víctor Mayoral Vilches <v.mayoralv@gmail.com>
@vmayoral, please properly fill in PR template in the future. @SteveMacenski, use this instead.
|
I left it as requested in the template. Just curious, why did mergify complain? |
It's being really flaky, sorry, you're fine. It's like wack-a-mole at times with the regex for detecting if its changed. I test a change, it works, then a week later it comes back... I should just remove it but I feel like if I keep seeing it I'll eventually get to fixing it up again. |
Does this have any impact on licensing and/or the licenses which this package can be made available under? I see LTng is a GPL-variant license which makes me immediately suspect about an integration here where everything is BSD or Apache (except AMCL but for historical reasons we can't do much about that). I'm not clear as to why there's a fork or what specific changes were made relative to that fork. Can you shed some light here? In other words: if The last question is more abstract - why does this need to be in the Nav2 stack proper? I wonder if tooling for performance characterization / bringing to light issues should be in an auxiliary repository within ros-planning (ex nav2_performance or nav2_tracing or something). We have a few auxiliary repos for different project subsets that don't make sense to be in the main codebase itself like our website, tutorials, and in-the-works advanced features |
The contributions are licensed as Apache 2.0 in alignment with ros2_tracing. I don't there's any conflict with that but I'll defer to @christophebedard and @iluetkeb who may know more about it as the original authors of the LTTng integration.
That's argued here and here. I'll paste the most important parts: The rationale behind this comes from the the experience of having worked for a few months with the ROS 2 tracing infrastructure. To me, it eventually become rather cumbersome to have to maintain a single tracetools repo with all tracepoints (I experienced this directly with an extension of tracetools_acceleration we keep internally). Altogether this led me to realize that it might be better to simply add a From @christophebedard: This looks good to me. For reference, a similar rationale wrt the ROS 2 core itself was brought up a few months ago here: ros2/rclcpp#1738 (comment). If these were tracepoints for the ROS 2 core (i.e., currently rclcpp and below), I'd prefer to avoid forking tracetools/putting the tracepoints in other packages to avoid the usual fragmentation of open source code, but this isn't really the case. It's more of an extension, i.e., additional instrumentation for application-level layers above the core, so I think this is the way to go for these use-cases. Besides, there is nothing about LTTng that requires instrumentation (or, rather, tracepoints) to be centralized like it is currently for the ROS 2 core, and the interactions with the Trace action are not really affected by this choice.*
There's no specific instrumentation introduced just for specific nav2 packages just yet, just the core of the LTTng integration. Specific instrumentation will need to be added in follow up PRs based on what we decide to research. We could add that in here if that's needed. My first approach has been incremental given the novelty of the topic for nav2 (and also because reviewing will become easier that way). See here for an example instrumenting
I think there's merit in this proposal (and that was my first thought as well producing tracetools_acceleration, which aims to be generic for all hardware acceleration repos). The proposed code changes in here boil down to a new ROS 2 package called This could be done if there's consensus at the ros-planning level but note that, 1) nav2 (and all other subprojects of ros-planning) will still need to have a build dependency on such package (e.g. |
lttng-ust, which is what is being linked by ros2_tracing if enabled, is LGPL-licensed, and all its headers are MIT licensed, see The LGPL is well understood and entirely compatible to link with with Apache-2.0 and BSD code. Since all headers are MIT-licensed, there is also no problem from including them. btw, you're also asking what the potential benefit is. For that it may interest you that the old latency/phase-shift analysis I did for the ROS1 nav stack was based on custom tracepoints. With that, I could show that due to a lack of synchronization between the map updater and the planners, use of new perception data in planning had an extremely large jitter (up to 200ms in our experiments). Again in our experiments this reduced accuracy of obstacle avoidance by 5cm (meaning, we had to include an additional 5cm tolerance left and right of the robot, on top of the tolerance necessary due to gridmap resolution and sensor accuracy, etc, to prevent collisions). And this was a for a slow-moving robot (0.5m/s). Just imagine what it would do to a faster robot. This issue had been in the ros1 navstack for ages (in fact, it still is), and it took tracing to find it. And regarding performance: The Linux kernel is full of tracepoints. They use a bit different technology, but fundamentally the same approach. When you turn them all on, you definitely notice it, particularly for heavy things like tracing the scheduler which can have millions of events per second. But you only do that when you're measuring, and all other measurement methods have even worse impact. That's why everybody is profiling using tracing these days. When turned off, the impact of tracepoints is almost zero. |
OK, but what does that look like here? Right now, I don't see anything Nav2 specific and I'm not sure what adding tracepoints actually looks like in the context of this package. I'm looking to get an idea of what I'm getting myself into here. Looking at your image pipeline example makes me think that this is incredibly intrusive to the source code which I'm not a fan of unless these are removable broadly and only inserted when being actively used. Adding a bunch of these tracepoints in the source code would make some files much less readable with the flow of logic actually occurring. But I could see a situation where we have none in by default, but the ability to add them easily for users. Hence, the ask "what does this look like". Another good question is - how much maintenance does something like this need and who's doing that maintenance? I suppose is if these are more specific to packages, would it be sensible to instead have Note that I have not used this tool before, valgrind / GDB have served me well so far. So a bit of this is me getting up to speed on this. I'm not opposed to additional options, but I want the value of any software added to this project to be clear and immediately useful to some stakeholder. I don't add things "just because" based on a perceived need. I want at least one specific stakeholder I can point to that is going to use this work in their system or impact change in the project to ultimately merge it. It has been mentioned a few times that tracepoints add "essentially" no overhead. Do you have some metrics on what that means? Adds 0.1%, 1%, 10%, etc? Is this something that can be compiled into it only when used with a flag versus always being in the background? That would also help decouple a centralized Do you have a "tracing for idiots" tutorial link that I can potentially use to learn about this in the more abstract format before thinking about all of the ROS-y and Nav2-y specifics? I think step 1 here is just for me to know "what is this thing, why is it better than existing tools, and how can I use it". (cough and write a tutorial about it for both my own memory and for others). No matter what, I think we'll need a pretty detailed tutorial for the website to go along with this package so that users can know how to leverage it. That would be analogous to the image pipeline (or any other packages you've added it to) so it can be a central location for "working with tracepoints in ROS2"-style tutorial. Which would be nice to have 😄 Then to @iluetkeb's comment
Do you have a link to that? That sounds familiar, but jogging my memory now in context with details would help this discussion. I would like to use that as a baseline to measure this metric on Nav2 as well to see what's up. Fundamentally, the costmap and planner relationship hasn't really changed so I suspect that is still the case, so it would be good to see that and understand what other types of things tracing might be valuable to expose in the project. Edit: Another good question is how this would impact windows / mac users |
@SteveMacenski this was indicated above, see https://github.com/ros-perception/image_pipeline/pull/717/files#diff-5e106d08b6f4807955c147444021dc9e2a50dc414e0a5555428cf7c59f51ea74R81-R93 for a similar example but targeting the Again also, kindly note
If needed, those can be added here. We can pick a target and add a few probes. I just splitted the work adding first the skeleton and later additional probes over the skelethon for a) a first pass/assessment on your side (which is what's happening in here) and b) simplifying the review process.
Very little maintenance in my experience. Once the probes are introduced into the source code, they shouldn't be modified (unless we move into newer LTTng version that introduce API changes, which is not very likely I believe). I'm happy to help with the maintenance with the spare cycles that I have.
That's the way it works. We need a ROS 2 package that provides the tracepoint definitions (that's the proposed
See @christophebedard and @iluetkeb's https://arxiv.org/pdf/2201.00393.pdf, it's a good read for answers on the overhead. Tracing is disabled by default unless you have LTTng installed in your workstation. See https://gitlab.com/ros-tracing/ros2_tracing#tracing for more details on its operation. I personally prefer triggering tracing session from launch files (which frankly, it's pretty cool). See https://github.com/ros-acceleration/acceleration_examples/blob/main/nodes/faster_doublevadd_publisher/launch/vadd_capture.launch.py#L36-L50 for an example.
Yes, but I'd discourage this for a development tree. My experience so far indicates that you want your tracing close to your core source code (that's why this PR is in here and not against https://gitlab.com/ros-tracing/ros2_tracing. Note that most of the ROS 2 packages these days (official distros) ship with tracing. The impact is minimal (read more about it in the instrumentation design). For production environments, you can disable tracing at build-time as follows: colcon build --cmake-args " -DTRACETOOLS_DISABLED=ON" Note LTTng introduces huge advantages when compared to other benchmarking/tracing alternatives that operate at the userspace. LTTng does it from the kernelspace. Regarding documentation comments, +1, that was proposed above as well:
|
Sorry, its taken me some time to think this over and try to come up with some concise and clear thoughts. Is including this all that is needed before being able to include and use tracepoints in the source code itself? If so, what I'm thinking now is the following
This gives folks the ability to use it if they want, but not impact the source code if they're not actively working through an issue. It provides the documentation required to set it up so they can work with it, similar to the traceback tutorial so that we don't have to randomly guess what areas are most useful for users to have tracepoints added to for figuring out various issues. I don't want to be dealing with a flood of PRs to add tracepoints to every nook and cranny of the codebase on a constant basis based on the specific debug needs of a specific user. If they're working out a problem, they should have targeted tracebacks to work through that issue. Its not as much about run-time impact (though I would want to understand that) as much as it is about maintenance load it could add and decreasing code readability for contributors. I can get used to anything (heck, 5 years ago I wasn't even writing tests and now I write tests and documentation like its nothin'), but making sure that things are clean, fluid, and readable is an important point to keeping the barrier of entry low to get community users to contribute back and get involved in the project. I could, however, be convinced to add tracepoints into the stack itself if the following is met, which doesn't have to happen immediately
Basically, I want a user, a process, and the documentation for a typical user to be able to leverage it so the feature would reap benefits for the userbase. Does that sound fair? That would resolve my concerns described above. |
This sounds fair to me but it may take some time to get it all together.
This is the first thing I believe we should tackle to gain acceptance of tracing within the community. Beyond Nav2, I think this is of interest for everyone in the community so I'd make it ROS 2 generic. @christophebedard, would you like to partner and put something like this together for ROS 2? I know there exists one really good resource1 for ROS you created in the past. Would you be open to work together on this and put time to renew it for ROS 2? What's the best location for it? The I particularly like the idea of the analysis @christophebedard did around a simple "pipeline" of nodes. This intersects a bit with some of my "day work" at Xilinx so I can probably put quite time on it soon-ish. Footnotes |
I agree with all of the above.
I agree that we need some good documentation, with explanations and tutorial(s) for any and all contributors, to keep the barrier of entry as low as possible. We do have some documentation/tutorials, like this tutorial in the RTWG documentation, but the scope is fairly limited. I think it's best to re-use & link to existing resources whenever possible. For tracing in general, it could be this one. Simply adding that link to the This is similar to what we want to do for real-time-related documentation (ros-realtime/community#32).
I'm working on an equivalent & (big) improvement for ROS 2 😜 but unfortunately that means I can't really commit to spending much time on this documentation effort in the near future. At least not until May. However, I created an issue over on the |
I think a tutorial for the Nav2 or ROS 2 docs should look something like
Essentially tracing 101 like https://navigation.ros.org/tutorials/docs/get_backtrace.html |
If you guys can come up with a use-case (a particular kind of analysis) that would be particularly interesting to the Nav2 community, Bosch CR could might be able to help implementing and documenting it. It won't be short-term, more like in the next 6-9 months, but judging from the discussion you could live with that time-frame, right? |
That sounds fantastic. Absolutely no rush, most of the work too in Nav2 is usually scheduled around that kind of time block considering most of the contributors are volunteers working on spare time. Any analysis on any timetable you're able to is A-OK with me. I think a few things would be interesting to understand
Certainly some analysis in the algorithms are always welcome to help us improve them, but I'd say that that would take a backseat to the higher level architectural elements above. While I'm sure there are places we can improve performance in the algorithms, that's not where jitter / blocking / etc derives from that such analysis would bring to light. |
Happy to reopen, but I don't like stale things sitting around while there are action items. I think this discussion, when its ready to be continued, should be in a separate ticket and then we can reopen / open a new PR as appropriate from the outcome of those discussions. There are still yet a few documentation / explanation / role questions to be answered that are outside of this PR / repo so this is blocked for the immediate future. |
Basic Info
This contribution aims to bring the first few building blocks of LTTng instrumentation to the ROS 2 navigation stack. It aligns with ros-perception/image_pipeline#717 which added instrumentation to a perception stack package and also with REP-2008 PR benchmarking architecture.
Particularly, in a similar fashion to what was discussed here for
image_pipeline
, this PR forks tracetools and creates a new ROS 2 package namednav2_tracetools
which hopes to include only the meta data andTRACEPOINTS
fornavigation2
. This should increase the granularity level of the LTTng instrumentation making it more easy to maintain (instead of having to send PRs to tracetools and iterate in longer cycles).The current PR adds the building blocks but no package has been instrumented just yet to simplify the review and testing of these initial changes. First candidates for instrumentation are costmaps and controllers as indicated in past WG meetings. Such instrumentation can come in follow up PRs with analysis examples of the probes and resulting LTTng traces, as well as documentation extensions on how to benchmark such components with practical examples.
Description of contribution in a few bullet points
nav2_tracetools
package within the navigation2 meta-packageros2_navigation2
(i.e. all traces related to nav2 could be filtered by using this tracepoint provider). Similarly, launch files can easily enable/disable traces of nav2 using this tracepoint provider.Description of documentation updates required from your changes
No direct documentation changes required as part of this PR. Once first navigation2 components get instrumented, it'd be nice to have that aligned with documentation tutorials that show how to make use of it.
Future work that may be required in bullet points
To simplify review of changes, this PR is limited to the initial skeleton of the required changes for LLTng integration into nav2. The following work should be delivered in upcoming PRs:
For Maintainers: