Fix for STP violation #257

petervdonovan · 2023-08-08T18:32:59Z

This fixes yet another bug reported by Jacky.

The commit message is reproduced here for convenience:

The communication protocol remains undocumented AFAIK, but my working
assumption has been that if a federate receives a PTAG, then that means
that all messages that are expected over delayed connections have
arrived. Therefore, delayed connections are not guarded from STP
violations by the MLAA; this property is acceptable because delayed connections impose
no deadlock risk and in some cases (startup) this property is necessary to avoid
deadlocks, but it requires some special care when sending a PTAG.

Regarding testing: Although Jacky has provided an example program for reproducing the issue, it is not obvious to me whether we should turn it into a test since the failure is not reproduced reliably without running the program for some seconds (and that is locally, on a laptop, with the time.sleeps removed). I do not think that our current method of running tests in GH Actions can be treated as an effective or efficient way to detect races of this kind.

The same could be said of the other STP violation that Jacky reported.

NOTE: Previously an irrelevant change was included in this PR that actually introduced a bug. It has since been deleted (I rebased and force-pushed).

The communication protocol remains undocumented AFAIK, but my working assumption has been that if a federate receives a PTAG, then that means that all messages that are expected over delayed connections have arrived. Therefore, delayed connections are not guarded from STP violations by the MLAA; this property is acceptable because they impose no deadlock risk and in some cases (startup) it is necessary to avoid deadlocks, but it requires some special care when sending a PTAG.

edwardalee

LGTM!

petervdonovan · 2023-08-11T05:12:18Z

The timeout might just be flakiness 🙄 but the real reason why this cannot be merged yet is that it is not clear whether the problem has been fixed. I no longer reproduce it locally, but @jackykwok2024 can reproduce it with the latest RTI checked out.

It's hard for me to reproduce the problem locally (either in master or in my branch) unless I comment out the sleeps, so I had to comment out the sleeps in order to determine if my branch had fixed anything. In master, with the sleeps commented out, I was able to pretty consistently reproduce the problem within (say) 200 rounds, but in my branch I was not able to reproduce after running for more than 200 rounds a few different times.

This seems like it could be dependent on the timing behavior imposed by the hardware/operating system, so I tried to reproduce it on Wessel. I couldn't reproduce it, not even on master, although I did get this in master, which looks like a different bug:

Server 10 50
Server 10 50
Federation ID for executable /home/peter/lingua-franca/test/Python/fed-gen/fedfail/src-gen/federate__client1/federate__client1.py: a5969099fea749c6961bf6d0202e9adefcf1bada4f562de5
---- Start execution at time Thu Aug 10 22:07:43 2023
---- plus 349306463 nanoseconds
Federate 0: Successfully connected to RTI.
Federate 0: Connected to RTI at localhost:15045.
Federate 0: Starting timestamp is: 1691730464369868596.
Federate 0: Environment 0: ---- Spawning 5 workers.
_pickle.UnpicklingError: invalid load key, '\x00'.
Federate 0: FATAL ERROR: Could not deserialize deserialized_message.
WARNING: RTI: Destination federate 0 is no longer connected. Dropping message.
WARNING: RTI: Destination federate 0 is no longer connected. Dropping message.

In short I'm not sure how to proceed because although the changes in this branch fixes the problem on my machine, I'm no longer able to reproduce the behavior that Jacky is observing.

jackyk02 · 2023-08-11T18:06:09Z

It's hard for me to reproduce the problem locally (either in master or in my branch) unless I comment out the sleeps, so I had to comment out the sleeps in order to determine if my branch had fixed anything. In master, with the sleeps commented out, I was able to pretty consistently reproduce the problem within (say) 200 rounds, but in my branch I was not able to reproduce after running for more than 200 rounds a few different times.

In short I'm not sure how to proceed because although the changes in this branch fixes the problem on my machine, I'm no longer able to reproduce the behavior that Jacky is observing.

Hi Peter, thank you so much for the effort you've dedicated to addressing this bug. I've set up an AWS EC2 Instance and checked out the most recent RTI (with the updates for zero-delay connection). This branch seem to have addressed the issue, as I'm unable to reproduce it on the AWS instance. However, the problem still remains on my laptop. It does seem like the issue is tied to the timing behavior of the hardware/OS. I'll further explore this using my machine…

petervdonovan · 2023-08-14T18:58:55Z

OK, I expect the test to pass now.

Even though the exact error reported might not have been reproduced, I think we should merge this because it does demonstrably fix an actual bug. These error reports from @jackykwok2024 have been really helpful. It would be reasonable to re-raise the original issue if someone who can reproduce it can produce a file with full debug logs from both the RTI and federates, and ideally also a trace from fedsd.

petervdonovan force-pushed the another-stp-violation-fix branch from e4df640 to 6765099 Compare August 8, 2023 18:42

petervdonovan force-pushed the another-stp-violation-fix branch from 6765099 to 443ef34 Compare August 8, 2023 19:04

petervdonovan requested review from lhstrh and edwardalee August 8, 2023 19:08

edwardalee approved these changes Aug 10, 2023

View reviewed changes

lhstrh approved these changes Aug 14, 2023

View reviewed changes

Update LF ref again.

df6fb6f

petervdonovan force-pushed the another-stp-violation-fix branch from 570ce7a to df6fb6f Compare August 14, 2023 18:54

lhstrh merged commit 809c454 into main Aug 14, 2023

lhstrh deleted the another-stp-violation-fix branch August 14, 2023 23:11

lhstrh added the bugfix label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for STP violation #257

Fix for STP violation #257

petervdonovan commented Aug 8, 2023 •

edited

Loading

edwardalee left a comment

petervdonovan commented Aug 11, 2023

jackyk02 commented Aug 11, 2023 •

edited

Loading

petervdonovan commented Aug 14, 2023

Fix for STP violation #257

Fix for STP violation #257

Conversation

petervdonovan commented Aug 8, 2023 • edited Loading

edwardalee left a comment

Choose a reason for hiding this comment

petervdonovan commented Aug 11, 2023

jackyk02 commented Aug 11, 2023 • edited Loading

petervdonovan commented Aug 14, 2023

petervdonovan commented Aug 8, 2023 •

edited

Loading

jackyk02 commented Aug 11, 2023 •

edited

Loading