-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream continuously flip flops between HEALTHY and UNHEALTHY when DNS TTL = 0 #5477
Comments
this seems to be a snippet from the log. Can you provide the start of the log? I would expect to see some (or 1) of these log messages: https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer/base.lua#L432-L433 |
The log snippet above does include the "ttl=0 detected" message. See:
I've also attached a full log showing startup to the first 5 minutes. It appears the health check issue is repeating once per minute in this log. In our production environment, it seems to occur more frequently. |
reducing that log file to
Which is once per minute. Which makes sense since |
@chris-branch It should be the same in your production environment, can you check that? Please note that this is per nginx worker, so like I did above (selecting only |
this is interesting:
For There's two questions here;
ad2) flipping a record type means we consider it a completely new record. We remove that record and all it's addresses (and also the associated healthchecks). And then add the new record (and its associated healthchecks). But why would anything be unhealthy here? Looking at the logs again;
It seems that each flip of record type is followed by an UNHEALTHY, and right after a HEALTHY event. Which leads me to believe the entry that is being removed is marked unhealthy, and the new entry being added is marked as healthy. (due to the underlying problem they just happen to be the same IP+Port combination) So maybe that part is still ok. Thoughts on this @hishamhm @locao ? Leaves problem 1 for now. |
I think the problem was introduced here: https://github.com/Kong/lua-resty-dns-client/pull/56/files |
@chris-branch can you try and change this line: if (newQuery[1] or empty).ttl == 0 and ((oldQuery[1] or empty).ttl or 0) == 0 then (here: https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer/base.lua#L416) into if (newQuery[1] or empty).ttl == 0 and (
((oldQuery[1] or empty).ttl or 0) == 0 or
oldQuery[1].__ttl0Flag )then and see if the problem goes away? |
@Tieske - I modified the if statement in base.lua:
However, that didn't seem to help. I still see the upstream flip-flopping between Unhealthy/Healthy, and the log looks mostly the same. New log attached here: Looking at the code, one thing I wondered about with the prior fix for the Route 53 issue is that it's keeping track of the lastQuery and comparing that against the newQuery. The assumption seems to be that DNS queries will not happen more than 1 time per second per host. But is that a valid assumption? queryDns() is called on a 1-second timer, but it's also called by addressStillValid() which is called from getPeer(). So, it seems like queryDns() can potentially be called much more frequently. Regarding your question about the frequency in our production environment ... I think what I'm seeing matches up with what you said (1 occurrence per worker, per upstream, every 1 minute). However, I can't get the exact numbers because our production environment runs 8 instances of Kong across 2 datacenters and all of the error logs are aggregated to a common location, so the same PIDs can occur across instances, which makes it difficult to filter logs down to one specific worker on one specific instance. During a 2-minute period, I collected 600 UNHEALTHY/HEALTHY transitions across the entire cluster, which seems about right as a ballpark. It's definitely not 1 per second, per worker, per upstream like I originally thought. It just seems that way due to the number of instances + workers. |
@chris-branch Sorry, my bad, I made a mistake, the if (newQuery[1] or empty).ttl == 0 and
(((oldQuery[1] or empty).ttl or 0) == 0 or oldQuery.__ttl0Flag) then Can you give that a try? |
@Tieske - No worries. I applied the new code change, and that does appear to resolve the issue, at least with my small test scenario. I no longer see Unhealthy/Healthy flip flopping. I do see DNS being re-queried every 1 second, per worker, per upstream, but that appears to be intentional. Ideally, it would only do the timer-based query if/when the upstream is idle for 1 second. Since a DNS query already happens on every request when TTL = 0, the timer-based DNS query is sort of redundant if the upstream is serving requests frequently, but I don't think that's a huge issue. I've attached the latest log in case you'd like to review: errorlog_0123_0245.log Thanks for your help troubleshooting this! |
the queries should run once per minute, not once per second, so that seems to be another separate (?) issue. When a ttl=0 is encountered we inject a fake SRV record with a TTL of 60 seconds. This explicitly to update that record (for internal balancer purposes) only once per minute. Essentially we're only checking whether the ttl is still 0 or maybe it was changed. This line sets the expire time of the injected record:
The timer (runs once per second) checks for expired records here: Somehow the timer considers the record expired on each run. |
Ok, I think I found it. The first time the name is resolved for worker
The first requery log:
Looking at the timestamps there is 1 minute in between. So initially it works. The problem is that the |
@chris-branch can you give this another try? On top of the previous change, add these lines: if oldQuery.__ttl0Flag then
-- still ttl 0 so nothing changed
oldQuery.touched = time() --> Added
oldQuery.expire = time() + self.balancer.ttl0Interval --> Added
ngx_log(ngx_DEBUG, self.log_prefix, "no dns changes detected for ",
self.hostname, ", still using ttl=0")
return true
end |
* the previous record was not properly detected as a ttl=0 record by checking on the __ttl0flag we now do * since the "fake" SRV record wasn't updated with a new expiry time the expiry-check-timer would keep updating that record every second Fixes Kong issue Kong/kong#5477
* the previous record was not properly detected as a ttl=0 record by checking on the __ttl0flag we now do * since the "fake" SRV record wasn't updated with a new expiry time the expiry-check-timer would keep updating that record every second Fixes Kong issue Kong/kong#5477
@Tieske - The code in your PR looks good. I applied those changes (including the suggested change from @bungle) and executed my test setup. The log is clean, other than DNS refreshes every minute (as intended). New log file attached for your review: errorlog_0124_0203.log I'm guessing this will be included in a 1.4.4 and/or 1.5.1 release at some point. Any idea how far in the future that might be? |
@chris-branch awesome! thx so much for testing this. |
* the previous record was not properly detected as a ttl=0 record by checking on the __ttl0flag we now do * since the "fake" SRV record wasn't updated with a new expiry time the expiry-check-timer would keep updating that record every second Fixes Kong issue Kong/kong#5477
* the previous record was not properly detected as a ttl=0 record by checking on the __ttl0flag we now do * since the "fake" SRV record wasn't updated with a new expiry time the expiry-check-timer would keep updating that record every second Fixes Kong issue Kong/kong#5477
Summary
If an upstream target is configured with a domain name where the DNS lookup for that domain name results in a TTL value of 0, Kong will continuously toggle the upstream state between Unhealthy and then Healthy again every second. This results in a large number of error log entries even when using the default log level (
NOTICE
). Example log output down below.In our situation, we have several upstreams for internal services that are intentionally configured with TTL = 0 for load balancing/failover purposes that we are not able to modify.
Steps To Reproduce
To reproduce this issue, you'll need a domain name that has TTL = 0. I put together a test case that runs a CoreDNS server in a container in order to satisfy this requirement for testing.
./docker-up.sh
to get everything up and running (CoreDNS, Cassandra database, Kong migrations + bootstrap, Kong)docker logs -f kong
) and note the upstream continually flip-flops between Unhealthy and Healthy. Also note that DNS entries are repeatedly added/removed from the cache.Additional Details & Logs
Example abbreviated log output:
Kong version: 1.4.3 (also see the same issue with 1.4.2)
Operating system: Docker running on Mac OS 10.14.6 (Mojave) and also Pivotal Cloud Foundry
The text was updated successfully, but these errors were encountered: