Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(log) log error on ssl handshake failure #28

Merged
merged 2 commits into from
Sep 12, 2019
Merged

Conversation

onematchfox
Copy link
Contributor

Ran into a bit of a weird issue with this library. My upstream in Kong were periodically getting marked as unhealthy due to timeout issues but yet were never coming back to a healthy status. Yet, if I manually marked the upstream as healthy everything would work fine. After some digging I noticed that over the course of time, TCP timeout counter was slowly being incremented but yet never seemed to be reset. After some more digging (and the addition of this log line) I finally figured out I had SSL handshake issues as I hadn't configured lua_ssl_trusted_certificate. However, I never noticed this issue because I had tcp_failures set to 0 and SSL handshake is seen as a TCP failure rather than HTTP/S failure. I.e. in my case, my upstreams were healthy to begin with but they were never actually passing a healthcheck due to the SSL issues. On that note, I do wonder if this error (and some of the subsequent ones) really should be handled as TCP failures. I know that strictly they are so, but given that a specific TCP check has already exited by this point in time in the code, perhaps these other errors should be reported in such a way that the actual healthcheck type (http/s) fails too? Happy to raise this as an issue for further discussion if you feel this would be justified?

@onematchfox onematchfox closed this Sep 6, 2019
@onematchfox onematchfox reopened this Sep 6, 2019
@Tieske
Copy link
Member

Tieske commented Sep 6, 2019

I think a TCP failure is fine. It does trigger me however on the absence of tls as a protocol (besides http, https and tcp).

@Tieske
Copy link
Member

Tieske commented Sep 6, 2019

also wondering about the log line. It will be quite verbose in the logs on every failure. That said; having the wrong certificate is probably a hard config error, and should result in some log noise...

@onematchfox
Copy link
Contributor Author

onematchfox commented Sep 9, 2019

@Tieske, thanks for the comments although I'm left wondering if you're in agreement on them or not so perhaps I need to try explain my point of view in a bit more detail.

My, perhaps simplistic, view of health is as a binary status. An upstream is either healthy or not. Given this, a healthcheck can be succinctly defined in terms of what one considers healthy or not. This library, to my surprise, does not follow this definition. For the simple tcp checks this can be done - I can easily configure it to say that "I consider it healthy if I can make a tcp connection to a service within a given timeframe". But for http and https you need to explicitly define both healthy and unhealthy, with the effect being that this introduces a third status of "neither healthy nor unhealthy" if the check results in something the user has not explicitly defined. This effect can be seen both in the requirement to explicitly state the success/failure HTTP statuses (where the third state is any status code not explicitly defined in either of those). And, it can be seen in the fact that a TCP failure is not considered a HTTP/S failure. Under my, again simplistic(?), view of health, I am no longer (easily) able to define health as "Service returns HTTP 200 from a HTTPS request" and failure as not meeting that definition of health. Rather I need to define health (as per above), and also explicitly define every possible scenario under which I consider it a failure. I.e. if one cannot establish a HTTPS connection (ensure tcp_failures is set to something >0), and the full set of potential failure codes/HTTP statuses (500, 501, etc.). This goes against the principle of least astonishment for me at least. However, as mentioned, I guess my view could be somewhat simplistic and perhaps there are valid use cases for this 3rd health status. Thus I chose to leave this discussion as a side point initially and rather to focus on the fact that at a bare minimum I would expect to be able to reason from the logs as to what was going on with my healthchecks.

In my scenario given my incorrect assumptions I had configured the checks as follows:

type = "https"
healthy {
  successes = 2
  http_statuses = [200, 201, 300, 301, 302, 303, 304, 305, 306, 307, 308, 401, 403]
}
unhealthy {
  http_failures = 5
  http_statuses = [400, 404, 429, 500, 501, 502, 503, 504]
  timeouts = 3
}

And here is an extract from the logs:

2019/09/03 01:00:24 [debug] 51#0: *472198 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/03 01:00:52 [debug] 51#0: *472543 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/03 01:00:52 [debug] 51#0: *472543 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently healthy)
...
2019/09/03 01:32:53 [warn] 51#0: *495916 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) unhealthy TIMEOUT increment (1/3) for 'target.upstream.com(127.0.0.1:443)', context: ngx.timer
2019/09/03 01:33:09 [debug] 51#0: *496139 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/03 01:33:22 [debug] 51#0: *496289 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/03 01:33:22 [debug] 51#0: *496289 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
...
2019/09/04 01:32:54 [debug] 51#0: *1564273 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/04 01:32:55 [debug] 51#0: *1548547 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/04 01:32:55 [debug] 51#0: *1548547 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
2019/09/04 01:32:56 [warn] 51#0: *1548547 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) unhealthy TIMEOUT increment (2/3) for 'target.upstream.com(127.0.0.1:443)', context: ngx.timer
2019/09/04 01:33:17 [debug] 51#0: *1548809 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/04 01:33:25 [debug] 51#0: *1548909 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/04 01:33:25 [debug] 51#0: *1548909 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
...
2019/09/06 04:59:20 [debug] 56#0: *970714 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/06 04:59:32 [debug] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/06 04:59:32 [debug] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
2019/09/06 04:59:33 [warn] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) unhealthy TIMEOUT increment (3/3) for 'target.upstream.com(127.0.0.1:443)', context: ngx.timer
2019/09/06 04:59:33 [debug] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) event: target status 'target.upstream.com(127.0.0.1:443)' from 'true' to 'false'
2019/09/06 04:59:35 [debug] 56#0: *970902 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: #1
2019/09/06 04:59:35 [debug] 56#0: *970902 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently unhealthy)
2019/09/06 05:00:02 [debug] 56#0: *971223 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: nothing to do

As you can see, this issue manifests itself over the course of several days. In the entire period, there is no indication that a healthcheck is not passing and at the point at which it ends, my upstream has been marked unhealthy and will never come back to a healthy status without manual intervention. It was only when I noticed that the TIMEOUT was incrementing over multiple days without being reset that I eventually started digging into this library to try understand why. Incidentally I had also wondered why I never saw logs relating to what HTTP status was being returned by the upstream and when I noticed the log lines relating to this in the codebase, I realised the code must have been exiting at an earlier stage.

Now, I know that in my current case this was a configuration issue which yes, would make a lot of noise but I hope that you would agree that this is perfectly valid log noise. For another example that is perhaps more applicable to a non-configuration issue, please consider that this could also present in a live environment in the case of certificate failure. Regardless of the underlying reason I would definitely want to know that the healthcheck was failing not succeeding. So yes, I would personally say that there should at least be some noise in the logs about this.

Some final notes.

  • Yes, as I now understand the subtleties of this library I have configured TCP failures to a non-zero value now so I will notice these things but others may not.
  • The configuration I used at first (i.e. tcp_failures = 0) is the default example in the kong documentation at https://docs.konghq.com/1.3.x/health-checks-circuit-breakers/. Albeit in that example, it does not set timeout or interval either so these checks are ultimately disabled.
  • For an example of a less subtle distinction between TCP and HTTP failures - consider the lines of code at
    bytes, err = sock:send(request)
    if not bytes then
    self:log(ERR, "failed to send http request to '", hostname, " (", ip, ":", port, ")': ", err)
    if err == "timeout" then
    sock:close() -- timeout errors do not close the socket.
    return self:report_timeout(ip, port, hostname, "active")
    end
    return self:report_tcp_failure(ip, port, hostname, "send", "active")
    where the error is logged "failed to send http request to " but the error itself is actually a TCP failure and thus, discarded if one has not expliclty configured monitoring of TCP failures. Yes, it is strictly a layer 4 error but how much value does a user gain in this distinction of being forced to treat tcp and http errors separately at this point? Surely the ability to be able to send a HTTP request is implicit in the success of A HTTP/S healthcheck? Or at the very least, perhaps one could consider that this would be a sensible default?

@onematchfox
Copy link
Contributor Author

Also, I noticed this comment when browsing through the other open PRs #22 (review). I'd be happy to adjust my commit message to comply with https://github.com/Kong/kong/blob/master/CONTRIBUTING.md#commit-message-format (?) although I'm unclear as to what type this commit would fall under.

@Tieske
Copy link
Member

Tieske commented Sep 9, 2019

yes, please update the commit format; fix(log) <description> will do.

so a TCP failure is also an HTTP failure, if the check is HTTP(s)
and an HTTP(S) success is also a TCP success

wondering how to deal with the timeouts ??

@Tieske
Copy link
Member

Tieske commented Sep 9, 2019

also can you add a section 1.0.x (unreleased) to the readme history section?

@onematchfox onematchfox changed the title Log error on SSL handshake failure - don't fail silently fix(log) log error on ssl handshake failure Sep 10, 2019
@onematchfox
Copy link
Contributor Author

onematchfox commented Sep 10, 2019

so a TCP failure is also an HTTP failure, if the check is HTTP(s)

Yes

and an HTTP(S) success is also a TCP success

This is already the case. There is only one definition of success which results in all failure counters being reset.

wondering how to deal with the timeouts ??

My personal preference would be to have a single attribute, failures, rather than having to configure tcp_failures, http_failures and timeout separately. The set of checks performed are based on the type of healthcheck and any failure would result in the counter behind this being incremented wherein the log can provide further information as to exactly what failed. The code to define the set of checks require for a specific typeof healthcheck already exists: E.g.

if self.checks.active.type == "tcp" then
sock:close()
return self:report_success(ip, port, hostname, "active")
end

and
if self.checks.active.type == "https" then
local session
session = sock:sslhandshake(nil, hostname,
self.checks.active.https_verify_certificate)
if not session then
sock:close()
return self:report_tcp_failure(ip, port, hostname, "connect", "active")
end
end

But, given that this functionality exists I expect there are folks out there who do appreciate the ability to treat these different types of failures individually? So, to avoid a breaking change by replacing http_failures, timeouts and tcp_failures with a single field, I guess one could approach it in a backwards compatible manner by introducing an aggregate/master failure counter that would be always be incremented - either, along with the individual failure counters or by itself if those are unset (= 0)?

@onematchfox
Copy link
Contributor Author

yes, please update the commit format; fix(log) will do.
also can you add a section 1.0.x (unreleased) to the readme history section?

done and done

@Tieske Tieske merged commit 6adc1a8 into Kong:master Sep 12, 2019
@Tieske
Copy link
Member

Tieske commented Sep 12, 2019

I'll open a new issue regarding the counters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants