“Ghost” kong instances never disappear and drastically slow down the startup process #2192

pamiel · 2017-03-10T10:05:08Z

Summary

I’m deploying Kong in a Docker/Kubernetes environment. As I’m launching tests, I’m constantly starting and stopping Kong containers, but I always have 1 running node at a time.

After a couple of such stop/restart (but I cannot tell you how much… but less than 10 and this is not happening exactly after the same number of stop/starts), the kong startup takes longer and longer (up to several minutes).
Logs are showing the following lines during the startup process, and Kong startup takes 15 to 20s to go to the next line:

2017/03/10 08:46:30 [warn] could not join cluster at 10.1.5.23:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:33 [warn] could not join cluster at 10.1.7.158:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:36 [warn] could not join cluster at 10.1.7.134:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:39 [warn] could not join cluster at 10.1.7.5:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:42 [warn] could not join cluster at 10.1.7.173:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:45 [warn] could not join cluster at 10.1.7.147:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:48 [warn] could not join cluster at 10.1.10.163:7946, if the node does not exist anymore it will be purged automatically

It looks there as some Kong “ghosts” instances (because I only have 1 single node active at a time) !
As said in the log, I was hoping that those “ghosts” would be “purged”… but this is not the case: if I stop&start again Kong, then those lines will be there again (and it would still take a veeerrrry long time to start). And this number of line is sometimes increasing, making the next restarts be even longer !

Going into the cassandra DB I’m using, I can see the ghosts in the “nodes” db:

name                                                     | cluster_listening_address | created_at 
----------------------------------------------------------------------------------------------------------+---------------------------+-------------------------- 
kong-ff56536535b0c47a68b93b01548e5e51-ppo3w_0.0.0.0:7946_d051cdb3da8d4bbcb3f1d4d768af5fdb |           10.1.7.147:7946 | 2017-03-09 10:42:31+0000
kong-kdc5y_0.0.0.0:7946_369d51255a5647a1a9166d50ba2fa7d3 |           10.1.7.173:7946 | 2017-03-09 10:55:23+0000
kong-ws11y_0.0.0.0:7946_6f9ce1ccace84a0292f7457961c2b71d |             10.1.7.5:7946 | 2017-03-09 12:43:23+0000 
kong-p0m3r_0.0.0.0:7946_2d116f644d9c42cd84efe01c8f71dee8 |          10.1.10.163:7946 | 2017-03-09 09:04:33+0000 
kong-18bqq_0.0.0.0:7946_b5ea3f460d6b485a92b9bf88990f4ee2 |           10.1.2.252:7946 | 2017-03-09 08:49:46+0000 
kong-ff56536535b0c47a68b93b01548e5e51-defn3_0.0.0.0:7946_c189218f1bfa4ae0ba216ebebf655b00 |            10.1.5.23:7946 | 2017-03-10 08:45:42+0000
kong-xaib1_0.0.0.0:7946_c53cef1332224bc19bb14edd724740c0 |           10.1.7.134:7946 | 2017-03-09 15:06:53+0000
kong-hita0_0.0.0.0:7946_1fba656558e8461c9890b06f8e9b3158 |           10.1.7.158:7946 | 2017-03-09 15:57:20+0000
kong-pvfk2_0.0.0.0:7946_715627b6aff3478f9928cc698747f375 |             10.1.7.4:7946 | 2017-03-10 08:46:52+0000
 (9 rows)

The only active node corresponds to the last line.

Looking to the “Edge-case scenario” section of the documentation https://getkong.org/docs/0.9.x/clustering/#automatic-cache-purge-on-join, I was expecting that this could be a cache purge use case… but when I run kong cluster members, I cannot see my gohsts:

$ kong cluster members 
kong-pvfk2_0.0.0.0:7946_715627b6aff3478f9928cc698747f375 10.1.7.4:7946   alive

Same for the kong cluster reachability:

$ kong cluster reachability 
Total members: 1, live members: 1 
Starting reachability test...         
Successfully contacted all live nodes

Even waiting after the cluster_ttl_on_failure delay (kept to the default value of 3600), I still have my ghosts in the table (look in the cql output above: some of the ghosts are more than a day old).

Looking to the on-going issues, it looks close to what here discussed here #2182 , #2164 and #2125 but I’m not on 0.10.0 and I only have 1 running instance => not exactly the same issues.

Configuration around cluster_ttl_on_failure, cluster_listen, cluster_listen_rpc etc is not modified (default values).
Of course, if I reset the database, everything comes back to a normal behaviour!

Steps To Reproduce

See above

Additional Details & Logs

Kong version: 0.9.9
Operating System: Alpine 3.5
Run with Docker 1.2 and Kubernetes 1.4
Serf version: 0.7.0

The text was updated successfully, but these errors were encountered:

subnetmarco · 2017-03-10T17:00:07Z

Can you set cluster_ttl_on_failure to 60? That will automatically remove a failed node from the database in one minute.

pamiel · 2017-03-12T10:24:06Z

Nothing changed with cluster_ttl_on_failure set to 60: ghost lines are still present in the db and increasing by 1 at each new start, and each start still displays the could not join cluster at lines...

Strangely, nothing related to a regular action that would happen every 60 seconds is displayed in the log (kong being started with --vv option + debug log level): I only have the following logs every 30 sec:

2017/03/12 10:12:20 [debug] 177#0: *66 [lua] cluster.lua:20: log(): [cluster] sending keepalive event to datastore
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 31 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 32 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 33 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 34 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 35 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [notice] 177#0: signal 17 (SIGCHLD) received
2017/03/12 10:12:20 [info] 177#0: waitpid() failed (10: No child process)
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 6 times
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database

Can you give me an example of logs I am supposed to have for this automatic failed node removal process, so that I could track them?

subnetmarco · 2017-03-13T21:08:01Z

@pamiel sorry, I forgot to mention that when you set cluster_ttl_on_failure=60 only new nodes will have a TTL of 60, the old ones will still have the old 3600 TTL (1 hour).

You can start with a fresh environment by:

Shutting down every node
Delete the contents of the nodes table by executing TRUNCATE nodes; on Cassandra
Start nodes again (make sure that they all have cluster_ttl_on_failure=60 set)

By doing so, when you forcibly shut down a node it should only logged in the nodes table for 60 seconds, and then it should automatically disappear.

subnetmarco · 2017-03-13T21:10:09Z

Even waiting after the cluster_ttl_on_failure delay (kept to the default value of 3600), I still have my ghosts in the table (look in the cql output above: some of the ghosts are more than a day old).

@pamiel in the meanwhile I will be investigating this problem.

You can check anytime the remaining TTL by running:

SELECT cluster_listening_address, TTL(cluster_listening_address) FROM nodes;

pamiel · 2017-03-14T10:25:58Z

@thefosk Mitigated results...

I updated the cluster_ttl_on_failure to 60 but it did not provide visible results... Still log lines + database with ghost lines appearing after a couple of start/stop
I then tried another tests, adding a preStop clause in the Kong's Kubernetes deployment file so that I'm sure to execute the kong stop at the end of each node (I was expecting that a correct stop might clean the node reference somewhere...).
It looked at the beginning that things were going better... but then the ghosts came back, visible through the could not join cluster at xx.xx.xx.xx:7946, if the node does not exist anymore it will be purged automatically log lines, and through the nodes table content.
Now, I have 30 rows while only 3 nodes are up and running for the last 12 hours.

I just run the SELECT you mentioned. The output is there (I unfortunately started my second test with a ttl of 3600...):

 cluster_listening_address | ttl(cluster_listening_address)
---------------------------+--------------------------------
           10.1.6.163:7946 |                           3590
           10.1.5.112:7946 |                           2401
            10.1.1.78:7946 |                           null
           10.1.12.71:7946 |                            777
            10.1.7.14:7946 |                           null
           10.1.10.27:7946 |                           null
            10.1.5.72:7946 |                           null
           10.1.7.185:7946 |                           null
[...]
(30 rows)

and many other lines with a null value on the TTL, and with a creation date that is more than 12h old.
I tracked one of the counters reaching 0... and for this one, the node disappeared from the table !... but I still have many other lines with a null ttl, and not disappearing...

One complementary information: the could not join cluster at xx.xx.xx.xx:7946, if the node does not exist anymore it will be purged automatically lines only appear when running the very first kong instance of the cluster. If a node is already up and running, starting another node (the 2nd, 3rd... of the cluster) does not display such logs and the startup goes much faster.
If I scale down back to 0 and restart again a first node, then I have again the log lines and the slow startup process...

I will reset my nodes table, set the ttl to 60, keep the preStop to be sure to stop kong properly (unless you tells me not to do so), and start again some tests (especially to see if the ghost corresponds to 1st nodes of the cluster or not).

subnetmarco · 2017-03-15T00:24:59Z

If a node is already up and running, starting another node (the 2nd, 3rd... of the cluster) does not display such logs and the startup goes much faster.

Yes this happens because Kong orders the nodes to find the one that has sent the most recent keep-alive, and tries to join that one. If it works, then it will not iterate over the list anymore.

I will investigate the null problem.

pamiel · 2017-03-22T17:12:36Z

Hi,
I managed to partially solve the issue by using the FQDN in the cassandra contact point: it looks the ghosts were created by Kong instances that had connection issues with Cassandra.
To be frank, I don’t know if they were not able to connect to cassandra at all, or if they more likely had access at the beginning of the startup… and then losing it, but logs of all ghost were having logs such as:

2017/03/14 13:16:58 [error] 90#0: *294 [lua] cluster.lua:20: log(): [cluster] could not retrieve nodes from datastore: [cassandra error] Could not create lock for prepare request: Error locking mutex: timeout, context: ngx.timer

For those instances, a new line was correctly created in the nodes table but the TTL was never set to the 60 value (it stays to null forever). I don’t know why the TTL process of Cassandra is not working properly, but after using the FQDN, then I have no more connectivity issue to cassandra in the Kong’s logs, and then no more ghosts in the db!

So, why am I saying that the issue is “partially solved”? Because this is the upstream connectivity that now fails: I was previously using Routes => no problem; but using host names within the Kubernetes network, then I receive a HTML “Kong Error - An invalid response was received from the upstream server” from Kong, and Kong’s logs says:
[error] 310#0: *1083 connect() failed (113: Host is unreachable) while connecting to upstream

And this time, usage of FQDN in upstream definition is not solving the point :(

As mentioned in my post here (Kong/kong-dist-kubernetes#6), there are indeed tens of similar posts related to (sometimes random) DNS issues, and I still do not understand what is the root cause of the issue (and in which component).

On Kong: search directive in resolv.conf is ignored. #2201, Cannot access Kong admin port 8001 after kong started #2207, Different DNS resolution between startup and actual runtime #1104, An invalid response was received from the upstream server #1601, and the one just coming: Help needed: Cassandra hosts not found, probably just a misconfiguration, Kong 0.10.0 #2239
On Kubernetes: Using nginx 1.9 DNS resolver output error periodically kubernetes/kubernetes#26309
On Kong distrib for Kubernetes: /bin/kong global script #3 and Benchmarking against other proxies #6
On other: Support search and ndots configuration options lua-resty-dns-client#6

Usage of FQDN looks just to be a workaround (not working for all use cases, as in my case) and as the issue might happen randomly, I’m not sure that it will not happen again later, even with FQDNs…

And I still don’t understand where is the real problem…

pamiel · 2017-12-01T18:04:07Z

Obviously no longer happens on 0.11 as Serf is out and no more nodes table !
So closing it.
Thanks.

thibaultcha added the task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. label Apr 28, 2017

pamiel closed this as completed Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

“Ghost” kong instances never disappear and drastically slow down the startup process #2192

“Ghost” kong instances never disappear and drastically slow down the startup process #2192

pamiel commented Mar 10, 2017

subnetmarco commented Mar 10, 2017

pamiel commented Mar 12, 2017

subnetmarco commented Mar 13, 2017 •

edited

Loading

subnetmarco commented Mar 13, 2017 •

edited

Loading

pamiel commented Mar 14, 2017

subnetmarco commented Mar 15, 2017

pamiel commented Mar 22, 2017

pamiel commented Dec 1, 2017

“Ghost” kong instances never disappear and drastically slow down the startup process #2192

“Ghost” kong instances never disappear and drastically slow down the startup process #2192

Comments

pamiel commented Mar 10, 2017

Summary

Steps To Reproduce

Additional Details & Logs

subnetmarco commented Mar 10, 2017

pamiel commented Mar 12, 2017

subnetmarco commented Mar 13, 2017 • edited Loading

subnetmarco commented Mar 13, 2017 • edited Loading

pamiel commented Mar 14, 2017

subnetmarco commented Mar 15, 2017

pamiel commented Mar 22, 2017

pamiel commented Dec 1, 2017

subnetmarco commented Mar 13, 2017 •

edited

Loading

subnetmarco commented Mar 13, 2017 •

edited

Loading