fix(log) log error on ssl handshake failure #28

onematchfox · 2019-09-06T14:26:28Z

Ran into a bit of a weird issue with this library. My upstream in Kong were periodically getting marked as unhealthy due to timeout issues but yet were never coming back to a healthy status. Yet, if I manually marked the upstream as healthy everything would work fine. After some digging I noticed that over the course of time, TCP timeout counter was slowly being incremented but yet never seemed to be reset. After some more digging (and the addition of this log line) I finally figured out I had SSL handshake issues as I hadn't configured lua_ssl_trusted_certificate. However, I never noticed this issue because I had tcp_failures set to 0 and SSL handshake is seen as a TCP failure rather than HTTP/S failure. I.e. in my case, my upstreams were healthy to begin with but they were never actually passing a healthcheck due to the SSL issues. On that note, I do wonder if this error (and some of the subsequent ones) really should be handled as TCP failures. I know that strictly they are so, but given that a specific TCP check has already exited by this point in time in the code, perhaps these other errors should be reported in such a way that the actual healthcheck type (http/s) fails too? Happy to raise this as an issue for further discussion if you feel this would be justified?

Tieske · 2019-09-06T17:33:05Z

I think a TCP failure is fine. It does trigger me however on the absence of tls as a protocol (besides http, https and tcp).

Tieske · 2019-09-06T17:35:14Z

also wondering about the log line. It will be quite verbose in the logs on every failure. That said; having the wrong certificate is probably a hard config error, and should result in some log noise...

onematchfox · 2019-09-09T08:22:08Z

@Tieske, thanks for the comments although I'm left wondering if you're in agreement on them or not so perhaps I need to try explain my point of view in a bit more detail.

My, perhaps simplistic, view of health is as a binary status. An upstream is either healthy or not. Given this, a healthcheck can be succinctly defined in terms of what one considers healthy or not. This library, to my surprise, does not follow this definition. For the simple tcp checks this can be done - I can easily configure it to say that "I consider it healthy if I can make a tcp connection to a service within a given timeframe". But for http and https you need to explicitly define both healthy and unhealthy, with the effect being that this introduces a third status of "neither healthy nor unhealthy" if the check results in something the user has not explicitly defined. This effect can be seen both in the requirement to explicitly state the success/failure HTTP statuses (where the third state is any status code not explicitly defined in either of those). And, it can be seen in the fact that a TCP failure is not considered a HTTP/S failure. Under my, again simplistic(?), view of health, I am no longer (easily) able to define health as "Service returns HTTP 200 from a HTTPS request" and failure as not meeting that definition of health. Rather I need to define health (as per above), and also explicitly define every possible scenario under which I consider it a failure. I.e. if one cannot establish a HTTPS connection (ensure tcp_failures is set to something >0), and the full set of potential failure codes/HTTP statuses (500, 501, etc.). This goes against the principle of least astonishment for me at least. However, as mentioned, I guess my view could be somewhat simplistic and perhaps there are valid use cases for this 3rd health status. Thus I chose to leave this discussion as a side point initially and rather to focus on the fact that at a bare minimum I would expect to be able to reason from the logs as to what was going on with my healthchecks.

In my scenario given my incorrect assumptions I had configured the checks as follows:

type = "https"
healthy {
  successes = 2
  http_statuses = [200, 201, 300, 301, 302, 303, 304, 305, 306, 307, 308, 401, 403]
}
unhealthy {
  http_failures = 5
  http_statuses = [400, 404, 429, 500, 501, 502, 503, 504]
  timeouts = 3
}

And here is an extract from the logs:

2019/09/03 01:00:24 [debug] 51#0: *472198 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/03 01:00:52 [debug] 51#0: *472543 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/03 01:00:52 [debug] 51#0: *472543 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently healthy)
...
2019/09/03 01:32:53 [warn] 51#0: *495916 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) unhealthy TIMEOUT increment (1/3) for 'target.upstream.com(127.0.0.1:443)', context: ngx.timer
2019/09/03 01:33:09 [debug] 51#0: *496139 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/03 01:33:22 [debug] 51#0: *496289 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/03 01:33:22 [debug] 51#0: *496289 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
...
2019/09/04 01:32:54 [debug] 51#0: *1564273 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/04 01:32:55 [debug] 51#0: *1548547 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/04 01:32:55 [debug] 51#0: *1548547 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
2019/09/04 01:32:56 [warn] 51#0: *1548547 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) unhealthy TIMEOUT increment (2/3) for 'target.upstream.com(127.0.0.1:443)', context: ngx.timer
2019/09/04 01:33:17 [debug] 51#0: *1548809 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/04 01:33:25 [debug] 51#0: *1548909 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/04 01:33:25 [debug] 51#0: *1548909 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
...
2019/09/06 04:59:20 [debug] 56#0: *970714 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: nothing to do
2019/09/06 04:59:32 [debug] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: #1
2019/09/06 04:59:32 [debug] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently mostly_healthy)
2019/09/06 04:59:33 [warn] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) unhealthy TIMEOUT increment (3/3) for 'target.upstream.com(127.0.0.1:443)', context: ngx.timer
2019/09/06 04:59:33 [debug] 56#0: *970856 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) event: target status 'target.upstream.com(127.0.0.1:443)' from 'true' to 'false'
2019/09/06 04:59:35 [debug] 56#0: *970902 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking unhealthy targets: #1
2019/09/06 04:59:35 [debug] 56#0: *970902 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) Checking target.upstream.com 127.0.0.1:443 (currently unhealthy)
2019/09/06 05:00:02 [debug] 56#0: *971223 [lua] healthcheck.lua:1014: log(): [healthcheck] (ta) checking healthy targets: nothing to do

As you can see, this issue manifests itself over the course of several days. In the entire period, there is no indication that a healthcheck is not passing and at the point at which it ends, my upstream has been marked unhealthy and will never come back to a healthy status without manual intervention. It was only when I noticed that the TIMEOUT was incrementing over multiple days without being reset that I eventually started digging into this library to try understand why. Incidentally I had also wondered why I never saw logs relating to what HTTP status was being returned by the upstream and when I noticed the log lines relating to this in the codebase, I realised the code must have been exiting at an earlier stage.

Now, I know that in my current case this was a configuration issue which yes, would make a lot of noise but I hope that you would agree that this is perfectly valid log noise. For another example that is perhaps more applicable to a non-configuration issue, please consider that this could also present in a live environment in the case of certificate failure. Regardless of the underlying reason I would definitely want to know that the healthcheck was ~~failing~~ not succeeding. So yes, I would personally say that there should at least be some noise in the logs about this.

Some final notes.

Yes, as I now understand the subtleties of this library I have configured TCP failures to a non-zero value now so I will notice these things but others may not.
The configuration I used at first (i.e. tcp_failures = 0) is the default example in the kong documentation at https://docs.konghq.com/1.3.x/health-checks-circuit-breakers/. Albeit in that example, it does not set timeout or interval either so these checks are ultimately disabled.

For an example of a less subtle distinction between TCP and HTTP failures - consider the lines of code at

lua-resty-healthcheck/lib/resty/healthcheck.lua

Lines 771 to 778 in 0250046

    
           bytes, err = sock:send(request) 
        
           if not bytes then 
        
             self:log(ERR, "failed to send http request to '", hostname, " (", ip, ":", port, ")': ", err) 
        
             if err == "timeout" then 
        
               sock:close()  -- timeout errors do not close the socket. 
        
               return self:report_timeout(ip, port, hostname, "active") 
        
             end 
        
             return self:report_tcp_failure(ip, port, hostname, "send", "active")

where the error is logged "failed to send http request to " but the error itself is actually a TCP failure and thus, discarded if one has not expliclty configured monitoring of TCP failures. Yes, it is strictly a layer 4 error but how much value does a user gain in this distinction of being forced to treat tcp and http errors separately at this point? Surely the ability to be able to send a HTTP request is implicit in the success of A HTTP/S healthcheck? Or at the very least, perhaps one could consider that this would be a sensible default?

onematchfox · 2019-09-09T08:29:47Z

Also, I noticed this comment when browsing through the other open PRs #22 (review). I'd be happy to adjust my commit message to comply with https://github.com/Kong/kong/blob/master/CONTRIBUTING.md#commit-message-format (?) although I'm unclear as to what type this commit would fall under.

Tieske · 2019-09-09T13:15:11Z

yes, please update the commit format; fix(log) <description> will do.

so a TCP failure is also an HTTP failure, if the check is HTTP(s)
and an HTTP(S) success is also a TCP success

wondering how to deal with the timeouts ??

Tieske · 2019-09-09T13:17:24Z

also can you add a section 1.0.x (unreleased) to the readme history section?

onematchfox · 2019-09-10T08:38:11Z

so a TCP failure is also an HTTP failure, if the check is HTTP(s)

Yes

and an HTTP(S) success is also a TCP success

This is already the case. There is only one definition of success which results in all failure counters being reset.

wondering how to deal with the timeouts ??

My personal preference would be to have a single attribute, failures, rather than having to configure tcp_failures, http_failures and timeout separately. The set of checks performed are based on the type of healthcheck and any failure would result in the counter behind this being incremented wherein the log can provide further information as to exactly what failed. The code to define the set of checks require for a specific typeof healthcheck already exists: E.g.

lua-resty-healthcheck/lib/resty/healthcheck.lua

Lines 752 to 755 in 0250046

    
           if self.checks.active.type == "tcp" then 
        
             sock:close() 
        
             return self:report_success(ip, port, hostname, "active") 
        
           end

and

lua-resty-healthcheck/lib/resty/healthcheck.lua

Lines 757 to 765 in 0250046

    
           if self.checks.active.type == "https" then 
        
             local session 
        
             session = sock:sslhandshake(nil, hostname, 
        
                                              self.checks.active.https_verify_certificate) 
        
             if not session then 
        
               sock:close() 
        
               return self:report_tcp_failure(ip, port, hostname, "connect", "active") 
        
             end 
        
           end

But, given that this functionality exists I expect there are folks out there who do appreciate the ability to treat these different types of failures individually? So, to avoid a breaking change by replacing http_failures, timeouts and tcp_failures with a single field, I guess one could approach it in a backwards compatible manner by introducing an aggregate/master failure counter that would be always be incremented - either, along with the individual failure counters or by itself if those are unset (= 0)?

onematchfox · 2019-09-10T08:39:04Z

yes, please update the commit format; fix(log) will do.
also can you add a section 1.0.x (unreleased) to the readme history section?

done and done

Tieske · 2019-09-12T10:15:23Z

I'll open a new issue regarding the counters

onematchfox closed this Sep 6, 2019

onematchfox reopened this Sep 6, 2019

fix(log) log error on ssl handshake failure

918e440

onematchfox force-pushed the master branch from 813893b to 918e440 Compare September 10, 2019 07:27

docs(readme) add version history 1.0.x (unreleased)

6ded924

onematchfox changed the title ~~Log error on SSL handshake failure - don't fail silently~~ fix(log) log error on ssl handshake failure Sep 10, 2019

Tieske merged commit 6adc1a8 into Kong:master Sep 12, 2019

Tieske mentioned this pull request Sep 12, 2019

a tcp failure on an http(s) check should increment the http-failure count #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(log) log error on ssl handshake failure #28

fix(log) log error on ssl handshake failure #28

onematchfox commented Sep 6, 2019

Tieske commented Sep 6, 2019

Tieske commented Sep 6, 2019

onematchfox commented Sep 9, 2019 •

edited

Loading

onematchfox commented Sep 9, 2019

Tieske commented Sep 9, 2019

Tieske commented Sep 9, 2019

onematchfox commented Sep 10, 2019 •

edited

Loading

onematchfox commented Sep 10, 2019

Tieske commented Sep 12, 2019

fix(log) log error on ssl handshake failure #28

fix(log) log error on ssl handshake failure #28

Conversation

onematchfox commented Sep 6, 2019

Tieske commented Sep 6, 2019

Tieske commented Sep 6, 2019

onematchfox commented Sep 9, 2019 • edited Loading

onematchfox commented Sep 9, 2019

Tieske commented Sep 9, 2019

Tieske commented Sep 9, 2019

onematchfox commented Sep 10, 2019 • edited Loading

onematchfox commented Sep 10, 2019

Tieske commented Sep 12, 2019

onematchfox commented Sep 9, 2019 •

edited

Loading

onematchfox commented Sep 10, 2019 •

edited

Loading