Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port search removal #319

Closed
wants to merge 14 commits into from
Closed

Port search removal #319

wants to merge 14 commits into from

Conversation

edwardalee
Copy link
Contributor

@edwardalee edwardalee commented Dec 17, 2023

This PR removes the port search done by the RTI if it cannot bind to the specified or default port. This never worked and resulted in federates taking a long time to fail as they searched over possible ports. There is a companion PR in lingua-franca.

This PR also fixes two issues that can cause the RTI to deadlock during shutdown, particularly if a task is kill with a SIGINT during execution. Both of these issues had been previously found by @erlingrj, but fixed only the federate side, not on the RTI side.

The first is that if a write to a socket fails because of a broken pipe, a SIGPIPE signal is issued, and, by default, the process is terminated. The reason for this is that a common use of sockets is to pipe one process's output to another, e.g. foo | bar, and if the second process crashes, you want the first process to exit. However, in the RTI (and the federates), there is a mutex lock that is held when writing to an outgoing socket (to prevent multiple threads from writing simultaneously to the socket). So, what can happen is that the process acquires the mutex and calls write(), but then write() fails, and before it can be return, a registered termination function is called. That termination function tries to acquire the same mutex in order to close the sockets. The result is a deadlock.

The second problem is similar, but only occurs if tracing is turned on. When a process exits the termination function tries to acquire a lock to flush tracing output, but that lock may be held by a thread that has exited anomalously while in a critical section.

It is possible that either of these problems could cause CI tests to hang rather than time out. When they time out, the CI tools try to kill the process with a SIGINT, and hence could trigger one of the above deadlocks, thereby failing to kill the process. The CI job will then run for the full six hours before being cancelled.

This PR also simplifies what is done when an abnormal termination occurs, e.g. SIGINT. In particular, it avoids acquiring any mutex locks and does not bother freeing memory (the memory will be freed by the OS anyway). The reason for freeing memory on normal termination is so that valgrind or similar tools can be used to check for memory leaks, but checking for memory leaks in a program that gets terminated with Control-C is not reasonable.

@edwardalee edwardalee requested a review from lhstrh December 20, 2023 19:11
Copy link
Member

@lhstrh lhstrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were clearly problems with the port search mechanism, though I'm not sure I fully understood what the problem was, and simplification is of course an improvement, but I do worry about the impact this my have on testing. If we have to wait minutes between tests for the port to free up, that won't serve us well. Should be print time stamps in the output to keep track of wait time, so we have a handle on this?

initialize_federate(fed_info, i);
rti.base.scheduling_nodes[i] = (scheduling_node_t *) fed_info;
}

int socket_descriptor = start_rti_server(rti.user_specified_port);
wait_for_federates(socket_descriptor);
normal_termination = true;
if (rti.base.tracing_enabled) {
// No need for a mutex lock because all threads have exited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to put this in an assertion. I'm weary of assumptions stated in comments that may become false or misleading when the code changes...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of any way to put this in an assertion.

@edwardalee
Copy link
Contributor Author

There were clearly problems with the port search mechanism, though I'm not sure I fully understood what the problem was, and simplification is of course an improvement, but I do worry about the impact this my have on testing. If we have to wait minutes between tests for the port to free up, that won't serve us well. Should be print time stamps in the output to keep track of wait time, so we have a handle on this?

Actually, if the RTI fails to grab the default port, it should print a message resulting from this line:

        lf_print("RTI failed to get port %d. Will try again.", port);

I have not seen this message in any of the tests, so I don't think we have an issue here.

If this starts to become an issue, we can modify the CI scripts so that they alternate ports. The port can be specified as a command-line argument.

@lhstrh
Copy link
Member

lhstrh commented Dec 21, 2023

Subsumed by #323.

@lhstrh lhstrh closed this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants