Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New warning message for repeated transforms #467

Closed
ayrton04 opened this issue Jun 4, 2020 · 56 comments
Closed

New warning message for repeated transforms #467

ayrton04 opened this issue Jun 4, 2020 · 56 comments

Comments

@ayrton04
Copy link

ayrton04 commented Jun 4, 2020

I just released robot_localization for Noetic, and I'm noting that while the tests pass, we are getting spammed with an endless stream of this message:

Warning: TF_REPEATED_DATA ignoring data with redundant timestamp for frame base_link at time 1432235782.066014 according to authority unknown_publisher
         at line 278 in /tmp/binarydeb/ros-noetic-tf2-0.7.1/src/buffer_core.cpp

For background, this is a common scenario with r_l:

Cycle 1: receive measurement from sensor with time stamp t1. Fuse measurement. Publish state (and transform) at time t1.
Cycle 2: receive measurement from sensor with time stamp t2 < t1. Rewind filter, insert measurement from t2, fuse measurement from t2 and then measurement from t1. Publish corrected state at time t1.

From the way things look, it seems that tf2 used to re-insert the transform, until this change:

eefb509

That would have been helpful for the use case in question, as overwriting the first transform is what we'd want to do.

However, it was later changed:

0462538#diff-c25683741e1f7b4e50eb6c5cdcad9661R275

Now insertData returns false and prints out an error.

Either of the previous behaviors would be preferable to the new one. What are the ramifications of restoring the behavior from this commit?

@SteveMacenski
Copy link
Member

+1 for reverting this or at minimum silencing that warning. There are plenty of legitimate reasons to update a transform at a given identical timestamp due to uncertainty in the system.

@ayrton04
Copy link
Author

Sorry for the direct mention, but do you have any thoughts on this, @tfoote?

@doisyg
Copy link

doisyg commented Jul 16, 2020

Same here, in favor of reverting this warning. I am getting flooded with this warning when using cartographer

@ayrton04
Copy link
Author

My first choice would be to get rid of the warning and overwrite the existing transform, if that's an option.

@Hugal31
Copy link

Hugal31 commented Jul 17, 2020

I created PR #475 to fully restore the old behavior. Feel free to merge or close.

@tfoote
Copy link
Member

tfoote commented Jul 20, 2020

The warning is there to tell you not to use it because it's not doing what you thought it was doing in the past.

Please see the discussions in #414 and #426 for why it was changed to be dropped not replaced. The system is designed to propagate final estimates of positions. If you expect to revise an estimate don't publish it until you're confident you don't need to revise it. The tf system is designed to be latency tolerant, I believe that most use cases would be better off waiting for the extra cycle than to get earlier less accurate data. tf users will have to wait for the union of all their links to become available. If there are very high latency requirements they shouldn't go through tf but use one of the other channels like the odom topic which gives the latest information directly.

We can design an extension to the system to support mutable data to make sure there's consistency. But just trying to overwrite every timestamp by using a timestamp allignment over a potentially lossy transport will lead to data inconsistencies and potentially very invalid values. Also potentially out of order delivery makes this problematic as the latest one "wins". So in most cases it might just work, but that's not a safe solution.

@ayrton04
Copy link
Author

Thanks for that. I have a better understanding of the change now.

While I can see the reason for not continually inserting the same transform, but I have two concerns about not permitting the overwrite:

  1. IMO, there may be perfectly valid cases where less accurate data is better than old data. I'd be willing to bet there are a fair number of fielded systems relying on this very concept. Granted, if you just ask for the most recent transform, you'll still get a usable-if-slightly-inaccurate transform, so perhaps that doesn't change anything, except that the user now has to change code to request the most recent transform instead of the one at the correct time.

  2. This represents a pretty big change from every other ROS 1 release, and it's affecting more than one package. My obvious interest in robot_localization aside, it's also (according to @doisyg) affecting packages like Cartographer. I know that new releases are good times to make such changes, but if I had to guess, this change is going to lead to a fair number of downstream code changes. For example, I'll need to add a feature to r_l (which I should have added ages ago, to be fair) to lag the generation of the state estimate.

Also potentially out of order delivery makes this problematic as the latest one "wins".

This would be an ideal use for sequence numbers, but I know those were deprecated (and I understand why).

I like the idea of designing an extension to allow for mutable transforms. If that is something for which nobody has extra cycles, would you consider changing this behavior so that, at least in this instance, we change this to only issue a ROS_WARN_ONCE or something?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 20, 2020

Also there's a bit of an issue assuming that we have knowledge / control over when we're ready to publish a transform for a timestamp. If you have a constant cycling thread updating a filter, you don't know at t + 1 whether a new measurement may come in to update the filter to a new timestamp t + 1 or refine the filter from a higher latency data source for t. Waiting to t + 1 to publish t due to that uncertainty would inject latency throughout the entire system. I suppose it could be updated so that when the filter is rolled back to include the new information, we just don't update tf but that doesn't seem like the best outcome.

ROS_WARN_ONCE -> or my suggestion is to just keep as is but silence it to debug throttled to 1 minute or so such that all the context is given to a user debugging a problem

@doisyg
Copy link

doisyg commented Jul 20, 2020

ROS_WARN_ONCE -> or my suggestion is to just keep as is but silence it to debug throttled to 1 minute or so such that all the context is given to a user debugging a problem

+1

@tfoote
Copy link
Member

tfoote commented Aug 13, 2020

I suppose it could be updated so that when the filter is rolled back to include the new information, we just don't update tf but that doesn't seem like the best outcome.

Repeating my comment from #475 (comment)

For comparison we've had a similar issue with extrapolation in the past. It's "easy" to mathematically allow extrapolation so you don't have to wait for all data to come in. But then it turns out that you amplify any errors in your system, and consequently all control algorithms just get worse performance. Nothing seemed wrong and we had much better latency, but after analyzing the performance with and without extrapolation what seemed like extra power was actually worse in every use case we tested.

The system is designed to handle latency. In a distributed system we know we have latency. You can choose to publish the refined pose for t at t+1 or you can publish the original for t and then publish t+1 which includes the internally revised value from t. Your example here of doing a revision of one time stamp is relatively simple. But that's just the tip of what might be considered. Looking at the bigger picture if you start thinking about something like moving horizion estimation. You might be revising N timesteps in the past based on each update. Now you could publish N timestamps back in history not just 1. The estimate back N timesteps will be "more accurate" but what's the latency? And how can any algorithm rely on the history? For localization we have the solution of having your incremental updates come out (ala odometry) and then there's the correction which fixes any mistakes that the incremental, but unrevisable odometry produced.

I'm open to adding throttling to the warning. But I want it to be loud. It should not be ignored. The insidious silent failures of the past are as bad or worse.

As I've said we can consider creating another better mutable transform. And analyze it for all the different use cases, and make sure that it's consistent across time and robust to data loss etc.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 13, 2020

I think just warning once the first occurrence is sufficiently loud and it seems @ayrton04, @doisyg, and I like that option the best as an immediate patch.

While I understand your point, a change this substantial should have been opened up for discussion, it looks like it was merged day-of (#459) and now there's 6 users here commenting/emoji-ing (there must be a verb for that now...) that this is a real problem for us. There should be an option to turn this off if you want this protection by default. There are legitimate needs for not having this, as Tom points out.

You can choose to publish the refined pose for t at t+1 or you can publish the original for t and then publish t+1 which includes the internally revised value from t.

For instance, lets imagine you have a global positioning data source that gives you hyper accurate corrections at irregular intervals based on when it has visibility of a target over cellular in the forest. This data source is being fused into a program that is doing event based fusion (new data comes in over a topic, triggers an update to TF). This filter has this positioning data coming in and also some odometry source at 1hz. At time 17, the filter updates from a new odometry message. At time 17.2, you get a message from your super accurate infrastructure system that you crossed into line of sight momentarily so you know exactly where you are, but you did so at time 17. Now, you want to update your view of the world to account for this drift correction, not just in local state, but for controllers currently running. You re-wind the filter, process that measurement, and now want to re-publish at time 17. That would fail. Then for the next 0.8s your entire system has a degraded view of the world that could be prevented. Eventually things would be caught up once they next odometry signal comes in to trigger an update at time 18. However, perhaps this is a robot in a sensitive application and that 800ms could result in some severe failure modes. Obviously this is an extreme example and not very well designed system, but the first example that came to mind about when this could be especially helpful to have access to.

I agree for most systems this should be fine over a longer time horizon (after reducing logging), however if you have very sensitive equipment, like that in which ROS2 is supposed to be able to support now, something like this could be a show stopper.

@AustinDeric
Copy link

AustinDeric commented Sep 17, 2020

I am seeing this issue and it seems to be isolated in the melodic to noetic upgrade. Maybe the issue should be addressed in robot_localization and not geometry2. However, robot_localization is saying geometry2 should be updated: cra-ros-pkg/robot_localization#574. there seems to be an impasse which i hope gets resolved soon.

@tfoote
Copy link
Member

tfoote commented Sep 18, 2020

While I understand your point, a change this substantial should have been opened up for discussion, it looks like it was merged day-of (#459) and now there's 6 users here commenting/emoji-ing (there must be a verb for that now...) that this is a real problem for us. There should be an option to turn this off if you want this protection by default. There are legitimate needs for not having this, as Tom points out.

This is not helpful. The specific PR you point to calls out in it's description that it's just a rebase of the long standing PR #426 it was specifically targeted to only land in the new rosdistro release. And it was given a strong warning to make it clear to downstream developer/users that they should pay attention here and fix the broken usage.

For instance, lets imagine you have a global positioning data source that gives you hyper accurate corrections at irregular intervals based on when it has visibility of a target over cellular in the forest. This data source is being fused into a program that is doing event based fusion (new data comes in over a topic, triggers an update to TF). This filter has this positioning data coming in and also some odometry source at 1hz. At time 17, the filter updates from a new odometry message. At time 17.2, you get a message from your super accurate infrastructure system that you crossed into line of sight momentarily so you know exactly where you are, but you did so at time 17. Now, you want to update your view of the world to account for this drift correction, not just in local state, but for controllers currently running. You re-wind the filter, process that measurement, and now want to re-publish at time 17. That would fail. Then for the next 0.8s your entire system has a degraded view of the world that could be prevented. Eventually things would be caught up once they next odometry signal comes in to trigger an update at time 18. However, perhaps this is a robot in a sensitive application and that 800ms could result in some severe failure modes. Obviously this is an extreme example and not very well designed system, but the first example that came to mind about when this could be especially helpful to have access to.

This is a bit of a contrived example to hit precisely the use case you call out. Your example of just updating one point is very unlikely if you have a high precision, high latency system, but it's still coming in within once cycle of the odometry. Odometry is typically published much faster to allow for any of these issues. Similarly following REP 105 odometry should never jump. And really this is the standard setup that we use all the time to differentiate odometric updates from localization updates.

The more powerful use case is for something like loop closure for SLAM where the full localization history will want to be revised and I would be quite happy to brainstorm a design to support that. I've got some good ideas after doing some thought experiments. However, that's well out of the scope of this discussion.

The robot_state_publisher and other implementations should be changed to not try to overwrite previously published data. What they thought it was doing was not happening. Instead it was producing memory growth in all tf listeners on the system as well as causing discontinuities in the data as well as potential race conditions across the distributed system. All of this silently passed and was actually likely degrading the experience of all users. This warning has shown the places where this misuse of the API was causing problems and I'm fine with throttling the warning. But the warning is there for a reason to let you know that it's not doing what you thought it was doing. It never did it and now it's warning about it.

As I've said I'm ok with throttling the warning if the volume of them is very high. We definitely shouldn't only warn once, because if this is an ongoing issue, which is triggered by a remote nodes activity. And may be triggered by more than one node at different times during runtime. And if you're debugging the behavior it should be displayed at the same time as the behavior is happening.

@SteveMacenski
Copy link
Member

Hi, Tully and I got on the phone and we decided on the following compromise:

  • Throttled debug logs about this error (every 10 seconds ?) so that if looking through debug logs, you can clearly see these error but hidden from terminal spam for most users.
  • 1 time error with a link to this ticket on the first instance.
  • Work on adding a patch to R_L if possible to turn off this behavior. This may not be possible due to the architecture and frankly Tom or I have the time to re-architect R_L for this change.

Anyone have an issue with that compromise?

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1

@peci1
Copy link
Contributor

peci1 commented Sep 20, 2020

I'll just add our bit of view. We've got a mobile robot with 50 Hz odometry and 0.3 Hz SLAM. These pretty slow SLAM updates forced us all the time to really think things through. What we ended up with is a TF tree like this:

map ---- odom ---- base_link
              \___ map_fast

The structure is kind of not ideal (map_fast would better be visualized as another parent of odom, but we need to keep the tree structure). map_fast is a frame created by taking the last known map->odom and connecting it with the last known odom->base_link. And it gets the timestamp of the faster odom->base_link TF. It is a kind of jumping odometry which, however, should still be more or less globally consistent.

This clearly allows nodes to select whether they are interested into precise transforms (by transforming into map), or if they need something that's fast, but may be slightly inaccurate for a short time (by transforming into map_fast).

I know this doesn't solve all the possible use-cases, but it seems to me it is in our case a more "correct" approach than overwriting the history. And it actually forced us into thinking which nodes need fast and which need precise data. With overwriting history, we'd never know which we actually got...

@SteveMacenski
Copy link
Member

I think that structure doesn't make a whole lot of sense to me, however I don't want to digress into that on this issue.

Thanks for the datapoint from another user. Do you have any objections with our proposal though?

@peci1
Copy link
Contributor

peci1 commented Sep 21, 2020

The proposal seems okay to me.

Maybe this issue points to some places in TF tutorials that could be improved to better explain what is and is not possible to do within the framework...

@ayrton04
Copy link
Author

I think the proposal sounds really solid, thanks @tfoote and @SteveMacenski. Meanwhile, I have a PR so that prevents r_l from re-publishing data with duplicate timestamps.

@SteveMacenski
Copy link
Member

I'm no percisely sure the mechanics to make that happen. https://github.com/ros/geometry2/pull/459/files#diff-23f1780342d49d752e5fabba0fa7de1cR278 it looks like that logError is shared by other error cases we should probably keep. Unless @tfoote you think we can change that to a CONSOLE_BRIDGE_logDebug?

@doisyg
Copy link

doisyg commented Sep 22, 2020

Hi, Tully and I got on the phone and we decided on the following compromise:

* Throttled  debug logs about this error (every 10 seconds ?) so that if looking through debug logs, you can clearly see these error but hidden from terminal spam for most users.

* 1 time error with a link to this ticket on the first instance.

* Work on adding a patch to R_L if possible to turn off this behavior. This may not be possible due to the architecture and frankly Tom or I have the time to re-architect R_L for this change.

Anyone have an issue with that compromise?

I think it is a good compromise. Throttling 10s is fine and will not flood the console anymore. However, if we want the users to notice it, maybe leaving it to ROS_INFO or ROS_WARN?

@machinekoder
Copy link

I'm seeing the same error with pristine robot_state_publisher after recompiling moveit. Pretty weird.

@machinekoder
Copy link

Might this be the result of multiple TransformStamped in the transforms message with the same timestamp (but different frames)? If so this seems like a bug to me.

@machinekoder
Copy link

Okay, looks like the error is coming from my moveit branch and it's internal use of tf2.

lucasw pushed a commit to lucasw/geometry2 that referenced this issue May 8, 2022
The TF_REPEATED_DATA error, while important, can be overly verbose,
sometimes flooding the terminal scrollback buffer such that no other
output can be found. Throttle the error to 10 seconds. Implement the
throttling directly since console bridge does not implement throttling.
Because ros::Time may not yet be initialized when calling insertData,
use the stamp from the message as "now".

For more information including the justification for the warning and
the justification for a 10 second throttle, read the discussion on
issue ros#467:

ros#467
@kvnptl
Copy link

kvnptl commented Sep 6, 2022

grep

2> >(grep -v TF_REPEATED_DATA buffer_core)

This one hides the entire warning message. Tested on ROS Noetic.

@broughtong
Copy link

grep

2> >(grep -v TF_REPEATED_DATA buffer_core)

This one hides the entire warning message. Tested on ROS Noetic.

I would like to point out that solutions like this aren't really great as they mess around with terminal formatting (eg. tqdm progress bars). Although so do the original warnings being spammed to stderr

@Gunz-katomaran
Copy link

is any other way to resolve this problem?

@raabuchanan
Copy link

grep

2> >(grep -v TF_REPEATED_DATA buffer_core)

This one hides the entire warning message. Tested on ROS Noetic.

Can you explain how to use this? Typing roslaunch my_package my_launch_file.launch 2> >(grep -v TF_REPEATED_DATA buffer_core) does not work

@souravsanyal06
Copy link

I have the same question as @raabuchanan .

@kvnptl
Copy link

kvnptl commented Feb 21, 2023

The usage is correct. Are you getting the same warning message as mentioned in the original issue?
If not, then modify the command as per the repetitive warning message.

Note: As mentioned above, be careful, it might mess up the terminal formatting.

@duda1202
Copy link

Do you know how to ensure that it only suppresses the TF_REPEATED_DATA? Whenever I use this command and there is some error on my node, it just crashes with no error messages.

@corot
Copy link

corot commented Mar 1, 2023

My humble contribution: the warning easily floods the logs when running simulations at relatively low frequencies, e.g. old good Stage publishes /clock at 10 Hz, and it's not uncommon to configure Gazebo to a similarly low frequency on not-very-powerful machines.
So any node producing tf msgs at > 10 Hz will trigger those warnings, as ros::Time::now() will return the same time more than once.
We "solved" the issue by adding incrementing nanoseconds to equal tf msgs, but obviously warning once or throttling to a very low rate is a much better option.

All that said, the positive side is that we are now much more aware of how tf2 works. Thanks @tfoote for the great explanation, and for making so clear why we have this warning.

@tfoote
Copy link
Member

tfoote commented Mar 10, 2023

That's an interesting case. It's a completely correct warning that the data is being dropped. If time is only turning over at 10Hz what value is publishing /tf at a faster rate? If time hasn't changed, I would have assumed that the source data shouldn't have changed either if the simulation hasn't updated yet. For example timers I would expect to be waiting for time to change over to the next timestep.

Another moderately simple option is to change the logging level up to ERROR or higher to avoid the warnings for those nodes.

@corot
Copy link

corot commented Mar 13, 2023

The thing is that in an ideal world, real-robot and simulation code should be 100% the same. On real robots you don't need to worry about discrete time, and so we shouldn't either on simulation.
But this warning forces us to do so, e.g. not publishing new data until clock tics as @tfoote proposes.

@tfoote
Copy link
Member

tfoote commented Mar 13, 2023

Yes, that's why I'm asking why things are turning over faster than the clock. I was asking why or how something is turning over and publishing faster than time is updating? If you have a timer it can only get triggered once per clock tick. If anything sleeps it should not return until the clock ticks over again. The joints and anything physical doesn't change until the simulator updates. Thus why is something publishing multiple times before the clock ticks?

When you slow down/make time jump in large discrete steps like 1/10th of a second it's expected that some things also loose precision. The general contract with sim time and discrete time is that things don't keep running. This is what allows you to pause the simulation. When you pause the simulation everything should stop. And until things tick over there generally shouldn't be new events being triggered. The system can be thought of as ticking over once per clock tick, and then pausing until the next clock tick, because they don't know when/if the next tick will come.

@raabuchanan
Copy link

The only thing that had ever worked for me is this:
https://answers.ros.org/question/381529/suppress-tf_repeated_data-warnings/

lucasw added a commit to lucasw/geometry2 that referenced this issue May 13, 2023
The TF_REPEATED_DATA error, while important, can be overly verbose,
sometimes flooding the terminal scrollback buffer such that no other
output can be found. Throttle the error to 10 seconds. Implement the
throttling directly since console bridge does not implement throttling.
Because ros::Time may not yet be initialized when calling insertData,
use the stamp from the message as "now".

For more information including the justification for the warning and
the justification for a 10 second throttle, read the discussion on
issue ros#467:

ros#467
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.