Port search removal #319

edwardalee · 2023-12-17T19:31:28Z

This PR removes the port search done by the RTI if it cannot bind to the specified or default port. This never worked and resulted in federates taking a long time to fail as they searched over possible ports. There is a companion PR in lingua-franca.

This PR also fixes two issues that can cause the RTI to deadlock during shutdown, particularly if a task is kill with a SIGINT during execution. Both of these issues had been previously found by @erlingrj, but fixed only the federate side, not on the RTI side.

The first is that if a write to a socket fails because of a broken pipe, a SIGPIPE signal is issued, and, by default, the process is terminated. The reason for this is that a common use of sockets is to pipe one process's output to another, e.g. foo | bar, and if the second process crashes, you want the first process to exit. However, in the RTI (and the federates), there is a mutex lock that is held when writing to an outgoing socket (to prevent multiple threads from writing simultaneously to the socket). So, what can happen is that the process acquires the mutex and calls write(), but then write() fails, and before it can be return, a registered termination function is called. That termination function tries to acquire the same mutex in order to close the sockets. The result is a deadlock.

The second problem is similar, but only occurs if tracing is turned on. When a process exits the termination function tries to acquire a lock to flush tracing output, but that lock may be held by a thread that has exited anomalously while in a critical section.

It is possible that either of these problems could cause CI tests to hang rather than time out. When they time out, the CI tools try to kill the process with a SIGINT, and hence could trigger one of the above deadlocks, thereby failing to kill the process. The CI job will then run for the full six hours before being cancelled.

This PR also simplifies what is done when an abnormal termination occurs, e.g. SIGINT. In particular, it avoids acquiring any mutex locks and does not bother freeing memory (the memory will be freed by the OS anyway). The reason for freeing memory on normal termination is so that valgrind or similar tools can be used to check for memory leaks, but checking for memory leaks in a program that gets terminated with Control-C is not reasonable.

lhstrh

There were clearly problems with the port search mechanism, though I'm not sure I fully understood what the problem was, and simplification is of course an improvement, but I do worry about the impact this my have on testing. If we have to wait minutes between tests for the port to free up, that won't serve us well. Should be print time stamps in the output to keep track of wait time, so we have a handle on this?

lhstrh · 2023-12-20T20:56:10Z

core/federated/RTI/main.c

        initialize_federate(fed_info, i);
        rti.base.scheduling_nodes[i] = (scheduling_node_t *) fed_info;
    }

    int socket_descriptor = start_rti_server(rti.user_specified_port);
    wait_for_federates(socket_descriptor);
+    normal_termination = true;
+    if (rti.base.tracing_enabled) {
+        // No need for a mutex lock because all threads have exited.


Is there a way to put this in an assertion. I'm weary of assumptions stated in comments that may become false or misleading when the code changes...

I don't know of any way to put this in an assertion.

edwardalee · 2023-12-20T22:02:05Z

There were clearly problems with the port search mechanism, though I'm not sure I fully understood what the problem was, and simplification is of course an improvement, but I do worry about the impact this my have on testing. If we have to wait minutes between tests for the port to free up, that won't serve us well. Should be print time stamps in the output to keep track of wait time, so we have a handle on this?

Actually, if the RTI fails to grab the default port, it should print a message resulting from this line:

        lf_print("RTI failed to get port %d. Will try again.", port);

I have not seen this message in any of the tests, so I don't think we have an issue here.

If this starts to become an issue, we can modify the CI scripts so that they alternate ports. The port can be specified as a command-line argument.

lhstrh · 2023-12-21T00:25:53Z

Subsumed by #323.

edwardalee added 2 commits December 17, 2023 10:46

Remove bogus port search

156592b

Point to lingua-franca port-search-removal

050e5f4

edwardalee added bugfix federated labels Dec 17, 2023

edwardalee and others added 12 commits December 17, 2023 12:11

Align RTI to other changes

89cae01

Comments only

947c9c8

Use random port for federate servers

ff7aa08

Use calloc instead of malloc to initialize the memory to 0.

0d77eb0

Ignore SIGPIPE signals to not shut down RTI upon write failure

4479138

Do not acquire trace lock in termination code

8f6c3f5

Documentation only

208661d

Remove all mutexes from termination code

fdbffa6

Comment only

b19f313

Distinguish normal from abnormal termination in RTI

f12da27

Reorganized termination to do less on abnormal termination

bcea8cc

Removed all tmp code and reviewed

8a5c296

edwardalee requested a review from lhstrh December 20, 2023 19:11

lhstrh approved these changes Dec 20, 2023

View reviewed changes

edwardalee mentioned this pull request Dec 21, 2023

Various bugfixes and cleanups in the support for federated programs #323

Merged

lhstrh closed this Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port search removal #319

Port search removal #319

edwardalee commented Dec 17, 2023 •

edited

Loading

lhstrh left a comment

lhstrh Dec 20, 2023

edwardalee Dec 20, 2023

edwardalee commented Dec 20, 2023

lhstrh commented Dec 21, 2023

Port search removal #319

Port search removal #319

Conversation

edwardalee commented Dec 17, 2023 • edited Loading

lhstrh left a comment

Choose a reason for hiding this comment

lhstrh Dec 20, 2023

Choose a reason for hiding this comment

edwardalee Dec 20, 2023

Choose a reason for hiding this comment

edwardalee commented Dec 20, 2023

lhstrh commented Dec 21, 2023

edwardalee commented Dec 17, 2023 •

edited

Loading