-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
“Ghost” kong instances never disappear and drastically slow down the startup process #2192
Comments
Can you set |
Nothing changed with Strangely, nothing related to a regular action that would happen every 60 seconds is displayed in the log (kong being started with
Can you give me an example of logs I am supposed to have for this automatic failed node removal process, so that I could track them? |
@pamiel sorry, I forgot to mention that when you set You can start with a fresh environment by:
By doing so, when you forcibly shut down a node it should only logged in the nodes table for |
@pamiel in the meanwhile I will be investigating this problem. You can check anytime the remaining TTL by running: SELECT cluster_listening_address, TTL(cluster_listening_address) FROM nodes; |
@thefosk Mitigated results...
I just run the SELECT you mentioned. The output is there (I unfortunately started my second test with a ttl of 3600...):
and many other lines with a null value on the TTL, and with a creation date that is more than 12h old. One complementary information: the I will reset my nodes table, set the ttl to 60, keep the preStop to be sure to stop kong properly (unless you tells me not to do so), and start again some tests (especially to see if the ghost corresponds to 1st nodes of the cluster or not). |
Yes this happens because Kong orders the nodes to find the one that has sent the most recent keep-alive, and tries to join that one. If it works, then it will not iterate over the list anymore. I will investigate the |
Hi,
For those instances, a new line was correctly created in the nodes table but the TTL was never set to the 60 value (it stays to So, why am I saying that the issue is “partially solved”? Because this is the upstream connectivity that now fails: I was previously using Routes => no problem; but using host names within the Kubernetes network, then I receive a HTML “Kong Error - An invalid response was received from the upstream server” from Kong, and Kong’s logs says: And this time, usage of FQDN in upstream definition is not solving the point :( As mentioned in my post here (Kong/kong-dist-kubernetes#6), there are indeed tens of similar posts related to (sometimes random) DNS issues, and I still do not understand what is the root cause of the issue (and in which component).
Usage of FQDN looks just to be a workaround (not working for all use cases, as in my case) and as the issue might happen randomly, I’m not sure that it will not happen again later, even with FQDNs… And I still don’t understand where is the real problem… |
Obviously no longer happens on 0.11 as Serf is out and no more nodes table ! |
Summary
I’m deploying Kong in a Docker/Kubernetes environment. As I’m launching tests, I’m constantly starting and stopping Kong containers, but I always have 1 running node at a time.
After a couple of such stop/restart (but I cannot tell you how much… but less than 10 and this is not happening exactly after the same number of stop/starts), the kong startup takes longer and longer (up to several minutes).
Logs are showing the following lines during the startup process, and Kong startup takes 15 to 20s to go to the next line:
It looks there as some Kong “ghosts” instances (because I only have 1 single node active at a time) !
As said in the log, I was hoping that those “ghosts” would be “purged”… but this is not the case: if I stop&start again Kong, then those lines will be there again (and it would still take a veeerrrry long time to start). And this number of line is sometimes increasing, making the next restarts be even longer !
Going into the cassandra DB I’m using, I can see the ghosts in the “nodes” db:
The only active node corresponds to the last line.
Looking to the “Edge-case scenario” section of the documentation https://getkong.org/docs/0.9.x/clustering/#automatic-cache-purge-on-join, I was expecting that this could be a cache purge use case… but when I run kong cluster members, I cannot see my gohsts:
Same for the kong cluster reachability:
Even waiting after the cluster_ttl_on_failure delay (kept to the default value of 3600), I still have my ghosts in the table (look in the cql output above: some of the ghosts are more than a day old).
Looking to the on-going issues, it looks close to what here discussed here #2182 , #2164 and #2125 but I’m not on 0.10.0 and I only have 1 running instance => not exactly the same issues.
Configuration around cluster_ttl_on_failure, cluster_listen, cluster_listen_rpc etc is not modified (default values).
Of course, if I reset the database, everything comes back to a normal behaviour!
Steps To Reproduce
See above
Additional Details & Logs
The text was updated successfully, but these errors were encountered: