-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to message lost event #192
Conversation
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@eboasson could you take a look to the implementation? I'm trying to add a reproducible test case in Do you know if I have to do something else to get the sample lost/rejected status working (I always read a 0 count from both)? |
@ivanpauno there’s nothing wrong with your implementation: I haven’t gotten around to bubbling up the event to the user yet, that’s all. Part of the reason I haven’t done that is procrastination, because I don’t want to do something I believe to be wrong — a personal judgement, I know what the spec says, I just disagree with it. The reason I think the spec is utterly wrong is that it conflates the case that really matters (those where the contract was inadvertently broken) with the ones that are in principle expected events:
Currently only (4) is implemented in Cyclone (it was already there when I got my hands on it). For the fact that (1) isn’t there yet, I can only ask for forgiveness ... I probably should not make things so complicated anyway, one doesn’t have to monitor the events 🙂 Combining “lost” and “rejected” is probably not such a good idea: a rejected one really hasn’t been lost. It will arrive eventually, unless it gets dropped by the writer and then you eventually end up with a “sample lost” event. P.S. One thing to keep in mind is that there is no guarantee that you can observe every single event. So, e.g., the “reason” in the sample rejected state is merely that of the most recent event, it doesn’t say anything about what came before that. |
I see, I misunderstood some DDS documentation and thought that case was covered.
Thanks for clarifying that. |
I see how the conflation of expected events (like overwriting history) and broken contracts is not great, but I do agree with @ivanpauno that having an indicator if samples are lost would be useful for diagnosing communication issues in any case. |
@ivanpauno the DDS family is a typical family, with a few feuds and long-running misunderstandings, but I don't think you really misunderstood anything. In this case, all the DDS specification (that is, DCPS 1.4) states about the "sample lost" status (and associated events, listeners, blah blah) is literally: (page 2-130)
(page 2-132)
Unfortunately, the DDSI-RTPS spec (which came a fair bit later) then barges in and tries to force an interpretation of it, so that you can have an implementation that arguably meets the DCPS specification and yet doesn’t meet the DDSI-RTPS one. (Weird, for a spec that purports to be no more than an interoperable protocol for the DCPS spec …) That the DDSI-RTPS spec got it wrong is clear, because one of the things currently under revision is precisely this matter. Of course, there is also the nice thing of having specs that are imprecise, incorrect or contradictory: one gains a lot of freedom 🙂 So what does "lost" mean? What does "never received" mean? By the reader history cache, or, as I forgot yesterday, by the application? Which means there are a few more cases that could be considered sample loss:
My case (1) above is a breach of contract and without a doubt worthy of a notification. Still, there is a complication: event ordering. The only guarantees I am (as of now) prepared to give are that when samples have been lost and it cannot be proved that they were never meant to be received:
Specifically, that means I'm not prepared to guarantee:
In short, you will know very little. But if you were to diligently read the "sample lost" status after each read/take and it says nothing has been lost, you can be sure nothing was lost up to and including the samples you just read that was also ordered with respect to ones you just read. That covers the clear-cut case reasonably well, I think 🙂 I agree with @jacobperron that the various other cases can be interesting information when monitoring system performance. My view is that none of them should be counted as lost samples (perhaps excepting rejecting a best-effort sample because of resource limits), but I do think it would be good to have an interface for monitoring them, and it would probably be wise to do this with a separate monitoring one. For volatile data, Cyclone can, in principle, distinguish between cases (2) and (2a) and case (1) with a keep-last writer — it’ll require some extension to the network protocol and I need to check whether the changes currently being done to DDSI-RTPS will cover it. How that works out if the data is transient-local needs some more thought. Do note that, in DDS semantics cases (2) and (2a) are the same, but in system performance monitoring, they are different … For best-effort, it can be packet loss (my case (3)), but really also overwriting in the history or rejecting a sample because of the resource limits. @ivanpauno, “rejected” for best-effort means something totally different from “rejected” for reliable. For best-effort case you should combine them. Sorry, my bad. Data that has been filtered out by definition isn’t lost, so (4) should never be counted as lost. From a monitoring perspective, it would make sense to count the number of samples that were filtered out by the reader as it might provide interesting information on the effectiveness of filtering at the writer. Case (5) and (5a) are similar, and it seems to me they should be treated the same as (2) and (2a): cases (2) and (5) are where the writer can’t push its data out quickly enough, and (2a) and (5a) where the reader can’t keep up with the writer. Cases (6a) and (6b) I think are best ignored: one should never set autopurge unless one means it. Case (7) is interesting, but it is either expected or flawed design/implementation of the data model or the writer. -- If one goes down this path, it is going to be tough to test it using the ROS2 interfaces. Cyclone has an interface to make it deaf or mute, which I use to test for handling of those "asymmetric" disconnections that my case (1) builds on. But that is not a standard feature and I don't think you'd want to expose it in ROS2! |
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
This sounds like combining lost/rejected statuses isn't a good idea, so I reverted that in 30909a4. For the moment, I'm planning to only add to |
Thanks for your super detailed answer @eboasson !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides one comment.
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com> Co-authored-by: Jacob Perron <jacob@openrobotics.org>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com> Co-authored-by: Jacob Perron <jacob@openrobotics.org>
Implements ros2/rmw#226.
See ros2/rmw#232 for related changes in RMW.