-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New warning message for repeated transforms #467
Comments
+1 for reverting this or at minimum silencing that warning. There are plenty of legitimate reasons to update a transform at a given identical timestamp due to uncertainty in the system. |
Sorry for the direct mention, but do you have any thoughts on this, @tfoote? |
Same here, in favor of reverting this warning. I am getting flooded with this warning when using cartographer |
My first choice would be to get rid of the warning and overwrite the existing transform, if that's an option. |
I created PR #475 to fully restore the old behavior. Feel free to merge or close. |
The warning is there to tell you not to use it because it's not doing what you thought it was doing in the past. Please see the discussions in #414 and #426 for why it was changed to be dropped not replaced. The system is designed to propagate final estimates of positions. If you expect to revise an estimate don't publish it until you're confident you don't need to revise it. The tf system is designed to be latency tolerant, I believe that most use cases would be better off waiting for the extra cycle than to get earlier less accurate data. tf users will have to wait for the union of all their links to become available. If there are very high latency requirements they shouldn't go through tf but use one of the other channels like the odom topic which gives the latest information directly. We can design an extension to the system to support mutable data to make sure there's consistency. But just trying to overwrite every timestamp by using a timestamp allignment over a potentially lossy transport will lead to data inconsistencies and potentially very invalid values. Also potentially out of order delivery makes this problematic as the latest one "wins". So in most cases it might just work, but that's not a safe solution. |
Thanks for that. I have a better understanding of the change now. While I can see the reason for not continually inserting the same transform, but I have two concerns about not permitting the overwrite:
This would be an ideal use for sequence numbers, but I know those were deprecated (and I understand why). I like the idea of designing an extension to allow for mutable transforms. If that is something for which nobody has extra cycles, would you consider changing this behavior so that, at least in this instance, we change this to only issue a |
Also there's a bit of an issue assuming that we have knowledge / control over when we're ready to publish a transform for a timestamp. If you have a constant cycling thread updating a filter, you don't know at
|
+1 |
Repeating my comment from #475 (comment)
The system is designed to handle latency. In a distributed system we know we have latency. You can choose to publish the refined pose for I'm open to adding throttling to the warning. But I want it to be loud. It should not be ignored. The insidious silent failures of the past are as bad or worse. As I've said we can consider creating another better mutable transform. And analyze it for all the different use cases, and make sure that it's consistent across time and robust to data loss etc. |
I think just warning once the first occurrence is sufficiently loud and it seems @ayrton04, @doisyg, and I like that option the best as an immediate patch. While I understand your point, a change this substantial should have been opened up for discussion, it looks like it was merged day-of (#459) and now there's 6 users here commenting/emoji-ing (there must be a verb for that now...) that this is a real problem for us. There should be an option to turn this off if you want this protection by default. There are legitimate needs for not having this, as Tom points out.
For instance, lets imagine you have a global positioning data source that gives you hyper accurate corrections at irregular intervals based on when it has visibility of a target over cellular in the forest. This data source is being fused into a program that is doing event based fusion (new data comes in over a topic, triggers an update to TF). This filter has this positioning data coming in and also some odometry source at 1hz. At time I agree for most systems this should be fine over a longer time horizon (after reducing logging), however if you have very sensitive equipment, like that in which ROS2 is supposed to be able to support now, something like this could be a show stopper. |
I am seeing this issue and it seems to be isolated in the melodic to noetic upgrade. Maybe the issue should be addressed in |
This is not helpful. The specific PR you point to calls out in it's description that it's just a rebase of the long standing PR #426 it was specifically targeted to only land in the new rosdistro release. And it was given a strong warning to make it clear to downstream developer/users that they should pay attention here and fix the broken usage.
This is a bit of a contrived example to hit precisely the use case you call out. Your example of just updating one point is very unlikely if you have a high precision, high latency system, but it's still coming in within once cycle of the odometry. Odometry is typically published much faster to allow for any of these issues. Similarly following REP 105 odometry should never jump. And really this is the standard setup that we use all the time to differentiate odometric updates from localization updates. The more powerful use case is for something like loop closure for SLAM where the full localization history will want to be revised and I would be quite happy to brainstorm a design to support that. I've got some good ideas after doing some thought experiments. However, that's well out of the scope of this discussion. The robot_state_publisher and other implementations should be changed to not try to overwrite previously published data. What they thought it was doing was not happening. Instead it was producing memory growth in all tf listeners on the system as well as causing discontinuities in the data as well as potential race conditions across the distributed system. All of this silently passed and was actually likely degrading the experience of all users. This warning has shown the places where this misuse of the API was causing problems and I'm fine with throttling the warning. But the warning is there for a reason to let you know that it's not doing what you thought it was doing. It never did it and now it's warning about it. As I've said I'm ok with throttling the warning if the volume of them is very high. We definitely shouldn't only warn once, because if this is an ongoing issue, which is triggered by a remote nodes activity. And may be triggered by more than one node at different times during runtime. And if you're debugging the behavior it should be displayed at the same time as the behavior is happening. |
Hi, Tully and I got on the phone and we decided on the following compromise:
Anyone have an issue with that compromise? |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1 |
I'll just add our bit of view. We've got a mobile robot with 50 Hz odometry and 0.3 Hz SLAM. These pretty slow SLAM updates forced us all the time to really think things through. What we ended up with is a TF tree like this:
The structure is kind of not ideal ( This clearly allows nodes to select whether they are interested into precise transforms (by transforming into I know this doesn't solve all the possible use-cases, but it seems to me it is in our case a more "correct" approach than overwriting the history. And it actually forced us into thinking which nodes need fast and which need precise data. With overwriting history, we'd never know which we actually got... |
I think that structure doesn't make a whole lot of sense to me, however I don't want to digress into that on this issue. Thanks for the datapoint from another user. Do you have any objections with our proposal though? |
The proposal seems okay to me. Maybe this issue points to some places in TF tutorials that could be improved to better explain what is and is not possible to do within the framework... |
I think the proposal sounds really solid, thanks @tfoote and @SteveMacenski. Meanwhile, I have a PR so that prevents r_l from re-publishing data with duplicate timestamps. |
I'm no percisely sure the mechanics to make that happen. https://github.com/ros/geometry2/pull/459/files#diff-23f1780342d49d752e5fabba0fa7de1cR278 it looks like that |
I think it is a good compromise. Throttling 10s is fine and will not flood the console anymore. However, if we want the users to notice it, maybe leaving it to ROS_INFO or ROS_WARN? |
I'm seeing the same error with pristine |
Might this be the result of multiple |
Okay, looks like the error is coming from my |
The TF_REPEATED_DATA error, while important, can be overly verbose, sometimes flooding the terminal scrollback buffer such that no other output can be found. Throttle the error to 10 seconds. Implement the throttling directly since console bridge does not implement throttling. Because ros::Time may not yet be initialized when calling insertData, use the stamp from the message as "now". For more information including the justification for the warning and the justification for a 10 second throttle, read the discussion on issue ros#467: ros#467
This one hides the entire warning message. Tested on ROS Noetic. |
I would like to point out that solutions like this aren't really great as they mess around with terminal formatting (eg. tqdm progress bars). Although so do the original warnings being spammed to stderr |
is any other way to resolve this problem? |
Can you explain how to use this? Typing |
I have the same question as @raabuchanan . |
The usage is correct. Are you getting the same warning message as mentioned in the original issue? Note: As mentioned above, be careful, it might mess up the terminal formatting. |
Do you know how to ensure that it only suppresses the TF_REPEATED_DATA? Whenever I use this command and there is some error on my node, it just crashes with no error messages. |
My humble contribution: the warning easily floods the logs when running simulations at relatively low frequencies, e.g. old good Stage publishes /clock at 10 Hz, and it's not uncommon to configure Gazebo to a similarly low frequency on not-very-powerful machines. All that said, the positive side is that we are now much more aware of how tf2 works. Thanks @tfoote for the great explanation, and for making so clear why we have this warning. |
That's an interesting case. It's a completely correct warning that the data is being dropped. If time is only turning over at 10Hz what value is publishing Another moderately simple option is to change the logging level up to ERROR or higher to avoid the warnings for those nodes. |
The thing is that in an ideal world, real-robot and simulation code should be 100% the same. On real robots you don't need to worry about discrete time, and so we shouldn't either on simulation. |
Yes, that's why I'm asking why things are turning over faster than the clock. I was asking why or how something is turning over and publishing faster than time is updating? If you have a timer it can only get triggered once per clock tick. If anything sleeps it should not return until the clock ticks over again. The joints and anything physical doesn't change until the simulator updates. Thus why is something publishing multiple times before the clock ticks? When you slow down/make time jump in large discrete steps like 1/10th of a second it's expected that some things also loose precision. The general contract with sim time and discrete time is that things don't keep running. This is what allows you to pause the simulation. When you pause the simulation everything should stop. And until things tick over there generally shouldn't be new events being triggered. The system can be thought of as ticking over once per clock tick, and then pausing until the next clock tick, because they don't know when/if the next tick will come. |
The only thing that had ever worked for me is this: |
The TF_REPEATED_DATA error, while important, can be overly verbose, sometimes flooding the terminal scrollback buffer such that no other output can be found. Throttle the error to 10 seconds. Implement the throttling directly since console bridge does not implement throttling. Because ros::Time may not yet be initialized when calling insertData, use the stamp from the message as "now". For more information including the justification for the warning and the justification for a 10 second throttle, read the discussion on issue ros#467: ros#467
I just released
robot_localization
for Noetic, and I'm noting that while the tests pass, we are getting spammed with an endless stream of this message:For background, this is a common scenario with
r_l
:Cycle 1: receive measurement from sensor with time stamp
t1
. Fuse measurement. Publish state (and transform) at timet1
.Cycle 2: receive measurement from sensor with time stamp
t2
<t1
. Rewind filter, insert measurement fromt2
, fuse measurement fromt2
and then measurement fromt1
. Publish corrected state at timet1
.From the way things look, it seems that
tf2
used to re-insert the transform, until this change:eefb509
That would have been helpful for the use case in question, as overwriting the first transform is what we'd want to do.
However, it was later changed:
0462538#diff-c25683741e1f7b4e50eb6c5cdcad9661R275
Now
insertData
returnsfalse
and prints out an error.Either of the previous behaviors would be preferable to the new one. What are the ramifications of restoring the behavior from this commit?
The text was updated successfully, but these errors were encountered: