-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for STP violation #257
Conversation
e4df640
to
6765099
Compare
The communication protocol remains undocumented AFAIK, but my working assumption has been that if a federate receives a PTAG, then that means that all messages that are expected over delayed connections have arrived. Therefore, delayed connections are not guarded from STP violations by the MLAA; this property is acceptable because they impose no deadlock risk and in some cases (startup) it is necessary to avoid deadlocks, but it requires some special care when sending a PTAG.
6765099
to
443ef34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The timeout might just be flakiness 🙄 but the real reason why this cannot be merged yet is that it is not clear whether the problem has been fixed. I no longer reproduce it locally, but @jackykwok2024 can reproduce it with the latest RTI checked out. It's hard for me to reproduce the problem locally (either in master or in my branch) unless I comment out the sleeps, so I had to comment out the sleeps in order to determine if my branch had fixed anything. In master, with the sleeps commented out, I was able to pretty consistently reproduce the problem within (say) 200 rounds, but in my branch I was not able to reproduce after running for more than 200 rounds a few different times. This seems like it could be dependent on the timing behavior imposed by the hardware/operating system, so I tried to reproduce it on Wessel. I couldn't reproduce it, not even on master, although I did get this in
In short I'm not sure how to proceed because although the changes in this branch fixes the problem on my machine, I'm no longer able to reproduce the behavior that Jacky is observing. |
Hi Peter, thank you so much for the effort you've dedicated to addressing this bug. I've set up an AWS EC2 Instance and checked out the most recent RTI (with the updates for zero-delay connection). This branch seem to have addressed the issue, as I'm unable to reproduce it on the AWS instance. However, the problem still remains on my laptop. It does seem like the issue is tied to the timing behavior of the hardware/OS. I'll further explore this using my machine… |
570ce7a
to
df6fb6f
Compare
OK, I expect the test to pass now. Even though the exact error reported might not have been reproduced, I think we should merge this because it does demonstrably fix an actual bug. These error reports from @jackykwok2024 have been really helpful. It would be reasonable to re-raise the original issue if someone who can reproduce it can produce a file with full debug logs from both the RTI and federates, and ideally also a trace from |
This fixes yet another bug reported by Jacky.
The commit message is reproduced here for convenience:
Regarding testing: Although Jacky has provided an example program for reproducing the issue, it is not obvious to me whether we should turn it into a test since the failure is not reproduced reliably without running the program for some seconds (and that is locally, on a laptop, with the
time.sleep
s removed). I do not think that our current method of running tests in GH Actions can be treated as an effective or efficient way to detect races of this kind.The same could be said of the other STP violation that Jacky reported.
NOTE: Previously an irrelevant change was included in this PR that actually introduced a bug. It has since been deleted (I rebased and force-pushed).