-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port search removal #319
Port search removal #319
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were clearly problems with the port search mechanism, though I'm not sure I fully understood what the problem was, and simplification is of course an improvement, but I do worry about the impact this my have on testing. If we have to wait minutes between tests for the port to free up, that won't serve us well. Should be print time stamps in the output to keep track of wait time, so we have a handle on this?
initialize_federate(fed_info, i); | ||
rti.base.scheduling_nodes[i] = (scheduling_node_t *) fed_info; | ||
} | ||
|
||
int socket_descriptor = start_rti_server(rti.user_specified_port); | ||
wait_for_federates(socket_descriptor); | ||
normal_termination = true; | ||
if (rti.base.tracing_enabled) { | ||
// No need for a mutex lock because all threads have exited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to put this in an assertion. I'm weary of assumptions stated in comments that may become false or misleading when the code changes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of any way to put this in an assertion.
Actually, if the RTI fails to grab the default port, it should print a message resulting from this line:
I have not seen this message in any of the tests, so I don't think we have an issue here. If this starts to become an issue, we can modify the CI scripts so that they alternate ports. The port can be specified as a command-line argument. |
Subsumed by #323. |
This PR removes the port search done by the RTI if it cannot bind to the specified or default port. This never worked and resulted in federates taking a long time to fail as they searched over possible ports. There is a companion PR in lingua-franca.
This PR also fixes two issues that can cause the RTI to deadlock during shutdown, particularly if a task is kill with a SIGINT during execution. Both of these issues had been previously found by @erlingrj, but fixed only the federate side, not on the RTI side.
The first is that if a write to a socket fails because of a broken pipe, a SIGPIPE signal is issued, and, by default, the process is terminated. The reason for this is that a common use of sockets is to pipe one process's output to another, e.g.
foo | bar
, and if the second process crashes, you want the first process to exit. However, in the RTI (and the federates), there is a mutex lock that is held when writing to an outgoing socket (to prevent multiple threads from writing simultaneously to the socket). So, what can happen is that the process acquires the mutex and calls write(), but then write() fails, and before it can be return, a registered termination function is called. That termination function tries to acquire the same mutex in order to close the sockets. The result is a deadlock.The second problem is similar, but only occurs if tracing is turned on. When a process exits the termination function tries to acquire a lock to flush tracing output, but that lock may be held by a thread that has exited anomalously while in a critical section.
It is possible that either of these problems could cause CI tests to hang rather than time out. When they time out, the CI tools try to kill the process with a SIGINT, and hence could trigger one of the above deadlocks, thereby failing to kill the process. The CI job will then run for the full six hours before being cancelled.
This PR also simplifies what is done when an abnormal termination occurs, e.g. SIGINT. In particular, it avoids acquiring any mutex locks and does not bother freeing memory (the memory will be freed by the OS anyway). The reason for freeing memory on normal termination is so that valgrind or similar tools can be used to check for memory leaks, but checking for memory leaks in a program that gets terminated with Control-C is not reasonable.