-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Connection: close" header leads to unstable instance #6799
Comments
We are experiencing the same problem with the Icinga Output Plugin for Logstash. After some working API calls, the Log shows "[2018-11-26 11:39:09 +0100] information/HttpServerConnection: Unable to disconnect Http client, I/O thread busy". After this, all Client TLS handshakes fail: "Error: Timeout was reached (10) during TLS handshake". |
While debugging hanging connection issues on our icinga2 setup, my finding is, that with "Connection: close" it makes a difference if the client sends the headers and body in one go or not. If the client sends them in one socket operation, everything is fine, because they will be put into the same TCP packet and parsed by icinga at the same time. With a fresh started icinga and one hanging connection I did a quick gdb backtrace and I get two suspicious threads with these backtraces:
The first one is blocking to acquire m_DataHandlerMutex in icinga::HttpServerConnection::DataAvailableHandler(), which the second one already acquired in icinga::HttpServerConnection::Disconnect() and is trying to close the connection, looping to wake up a (the other?) thread. I used the 2.10.2-1.xenial package from packages.icinga.com. Without debug info I cannot be sure whether the threads are in fact blocking on the same m_DataHandlerMutex, but I'm pretty certain they do. |
Reverting 13239c3 (which moved the disconnect path to icinga::HttpServerConnection::DataAvailableHandler) fixes our hanging connecion issue. |
SituationIn our case the system is stable if there are no timouts (especially TLS handshake) here. But with the begin of the first message in the log files, our cluster is beginning to get unstable. The cluster-, ido- and icinga-checks are getting critical or unknown. But the icinga daemon is still running. This is an example of the debug-log from our master2: What are the results of this problem?Deployment with the director: On our master1 we configured the director for automatic deployment. If there is one in such unstable icinga state, the daemon is not able to reload correctly. It seems that it is getting forked, but it can't stop the old childs. This is an example output while icinga2 is reloading. For us it seems icinga crashes while the step "information/ExternalCommandListener: 'command' stopped.", because it is the last output and systemd write still "reload" as state:
Sometimes it happens, if the master1 is hanging, some seconds later also the icinga daemon at the master2 get the same problem . State of the icinga2 daemon Very often we find lines like this in the systemd-log after a restart/reload: Problems with acknowledgements and setting downtimes in icingaweb2 The acknowledgment via icingaweb2 is not possible. There is a timout error: The same situation is with setting a downtime But the diffrence to the acknowledgement the downtime is set. Logstash-Plugin "logstash-output-icinga" We also use the Logstash Plugin "logstash-output-icinga ". If logstash send something via API to icinga, the connection is reset by icinga. This are possible return messages in the logstash log: I tested this also with on old icinga Setup (2.8.4-1). There icinga can handle the api request. Because of this problem I changed to the logstash-plugin "logstash-output-nagios" for using the command pipe. With this everything is fine and our cluster is still stable. Also we don't have TLS-Errors in the log-files. SummaryAt the end it is the same behaviour what we have if nessus is scanning the server - see |
Thus far, I'm not able to reproduce it with my previous analysis in #6514 (comment) If one could provide a simple reproducer script, or a given number of objects/IO required for this tests, much appreciated. I'm still wondering whether the requests need to run in full parallel, or going more into the socket events overloaded route. |
What I came up with is this snipet that uses openssl s_client and sends the header and body with a sleep in between. If you comment the sleep, it should work, with the sleep it results in icinga threads blocking. There's no parallelism needed, just one request with the sleep in between.
|
Thanks, that was the missing point in my brain. Test ScriptScript with
Current Behaviour
The doubled log line is misleading at best, so I've taken the extra mile and debugged this below. While reverting the suggested patch, there's a certain possibility that the "corked" functionality did hide this dead lock previously.
Patch fixed
Debugging Session
2 threads call
Thread 12This thread handles the main connection and calls
Thread 27This thread is doing the socket event polling and gets woken up in TlsStream::OnEvent which then results in Actually this thread is correctly closing the connection.
Why would the first thread call Disconnect()?
Dig deeper into the first thread being called just once via DataAvailableHandler():
SummaryActually the request body is not parsed at this point, where @swegener is 100% correct with his analysis. Therefore closing the request here makes absolutely no sense, and I did not think about this when fixing it for 2.10.0. Older versions of Icinga likely did not suffer from With the revert of the "Corked" stuff in 2.10.2 and #6738 such state machine like connections with wait times in between function properly again. |
Actually the `corked` functionality caused problems with not closing connections properly. Full Analysis: #6799 (comment) Full credits to @swegener :) fixes #6799
Actually the `corked` functionality caused problems with not closing connections properly. Full Analysis: #6799 (comment) Full credits to @swegener :) fixes #6799
This is a continuation from #6514.
Using the sequential curl requests with "Connection: close" header seems to work, but after firing up our API client which uses threaded job handling results in some errors again. If I don't send the "Connection: close" header, everything is fine (at least for 2.10.2, on some 2.8.x it was the exact opposite, that's why we changed the behaviour).
Here is the debug log:
Sometimes there are two entries for a disconnect with the same source port:
Log from client side:
I see some strange behaviour there:
Unfortunately the client is not opensource, but the only interaction with Icinga it does is pushing check results and get services from a host by API in worker threads (written in GO using http.Client).
Environment:
The text was updated successfully, but these errors were encountered: