Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“Ghost” kong instances never disappear and drastically slow down the startup process #2192

Closed
pamiel opened this issue Mar 10, 2017 · 8 comments
Labels
task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not.

Comments

@pamiel
Copy link
Contributor

pamiel commented Mar 10, 2017

Summary

I’m deploying Kong in a Docker/Kubernetes environment. As I’m launching tests, I’m constantly starting and stopping Kong containers, but I always have 1 running node at a time.

After a couple of such stop/restart (but I cannot tell you how much… but less than 10 and this is not happening exactly after the same number of stop/starts), the kong startup takes longer and longer (up to several minutes).
Logs are showing the following lines during the startup process, and Kong startup takes 15 to 20s to go to the next line:

2017/03/10 08:46:30 [warn] could not join cluster at 10.1.5.23:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:33 [warn] could not join cluster at 10.1.7.158:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:36 [warn] could not join cluster at 10.1.7.134:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:39 [warn] could not join cluster at 10.1.7.5:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:42 [warn] could not join cluster at 10.1.7.173:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:45 [warn] could not join cluster at 10.1.7.147:7946, if the node does not exist anymore it will be purged automatically
2017/03/10 08:46:48 [warn] could not join cluster at 10.1.10.163:7946, if the node does not exist anymore it will be purged automatically

It looks there as some Kong “ghosts” instances (because I only have 1 single node active at a time) !
As said in the log, I was hoping that those “ghosts” would be “purged”… but this is not the case: if I stop&start again Kong, then those lines will be there again (and it would still take a veeerrrry long time to start). And this number of line is sometimes increasing, making the next restarts be even longer !

Going into the cassandra DB I’m using, I can see the ghosts in the “nodes” db:

name                                                     | cluster_listening_address | created_at 
----------------------------------------------------------------------------------------------------------+---------------------------+-------------------------- 
kong-ff56536535b0c47a68b93b01548e5e51-ppo3w_0.0.0.0:7946_d051cdb3da8d4bbcb3f1d4d768af5fdb |           10.1.7.147:7946 | 2017-03-09 10:42:31+0000
kong-kdc5y_0.0.0.0:7946_369d51255a5647a1a9166d50ba2fa7d3 |           10.1.7.173:7946 | 2017-03-09 10:55:23+0000
kong-ws11y_0.0.0.0:7946_6f9ce1ccace84a0292f7457961c2b71d |             10.1.7.5:7946 | 2017-03-09 12:43:23+0000 
kong-p0m3r_0.0.0.0:7946_2d116f644d9c42cd84efe01c8f71dee8 |          10.1.10.163:7946 | 2017-03-09 09:04:33+0000 
kong-18bqq_0.0.0.0:7946_b5ea3f460d6b485a92b9bf88990f4ee2 |           10.1.2.252:7946 | 2017-03-09 08:49:46+0000 
kong-ff56536535b0c47a68b93b01548e5e51-defn3_0.0.0.0:7946_c189218f1bfa4ae0ba216ebebf655b00 |            10.1.5.23:7946 | 2017-03-10 08:45:42+0000
kong-xaib1_0.0.0.0:7946_c53cef1332224bc19bb14edd724740c0 |           10.1.7.134:7946 | 2017-03-09 15:06:53+0000
kong-hita0_0.0.0.0:7946_1fba656558e8461c9890b06f8e9b3158 |           10.1.7.158:7946 | 2017-03-09 15:57:20+0000
kong-pvfk2_0.0.0.0:7946_715627b6aff3478f9928cc698747f375 |             10.1.7.4:7946 | 2017-03-10 08:46:52+0000
 (9 rows)

The only active node corresponds to the last line.

Looking to the “Edge-case scenario” section of the documentation https://getkong.org/docs/0.9.x/clustering/#automatic-cache-purge-on-join, I was expecting that this could be a cache purge use case… but when I run kong cluster members, I cannot see my gohsts:

$ kong cluster members 
kong-pvfk2_0.0.0.0:7946_715627b6aff3478f9928cc698747f375 10.1.7.4:7946   alive 

Same for the kong cluster reachability:

$ kong cluster reachability 
Total members: 1, live members: 1 
Starting reachability test...         
Successfully contacted all live nodes

Even waiting after the cluster_ttl_on_failure delay (kept to the default value of 3600), I still have my ghosts in the table (look in the cql output above: some of the ghosts are more than a day old).

Looking to the on-going issues, it looks close to what here discussed here #2182 , #2164 and #2125 but I’m not on 0.10.0 and I only have 1 running instance => not exactly the same issues.

Configuration around cluster_ttl_on_failure, cluster_listen, cluster_listen_rpc etc is not modified (default values).
Of course, if I reset the database, everything comes back to a normal behaviour!

Steps To Reproduce

See above

Additional Details & Logs

  • Kong version: 0.9.9
  • Operating System: Alpine 3.5
  • Run with Docker 1.2 and Kubernetes 1.4
  • Serf version: 0.7.0
@subnetmarco
Copy link
Member

Can you set cluster_ttl_on_failure to 60? That will automatically remove a failed node from the database in one minute.

@pamiel
Copy link
Contributor Author

pamiel commented Mar 12, 2017

Nothing changed with cluster_ttl_on_failure set to 60: ghost lines are still present in the db and increasing by 1 at each new start, and each start still displays the could not join cluster at lines...

Strangely, nothing related to a regular action that would happen every 60 seconds is displayed in the log (kong being started with --vv option + debug log level): I only have the following logs every 30 sec:

2017/03/12 10:12:20 [debug] 177#0: *66 [lua] cluster.lua:20: log(): [cluster] sending keepalive event to datastore
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 31 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 32 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 33 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 34 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 35 times
2017/03/12 10:12:20 [debug] 177#0: *66 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database
2017/03/12 10:12:20 [notice] 177#0: signal 17 (SIGCHLD) received
2017/03/12 10:12:20 [info] 177#0: waitpid() failed (10: No child process)
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Connecting to kong-database
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Session connected to kong-database
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Socket retrieved from the connection pool, 6 times
2017/03/12 10:12:20 [debug] 178#0: *69 [lua] log.lua:22: debug(): Load balancing policy proposed to try host at: kong-database

Can you give me an example of logs I am supposed to have for this automatic failed node removal process, so that I could track them?

@subnetmarco
Copy link
Member

subnetmarco commented Mar 13, 2017

@pamiel sorry, I forgot to mention that when you set cluster_ttl_on_failure=60 only new nodes will have a TTL of 60, the old ones will still have the old 3600 TTL (1 hour).

You can start with a fresh environment by:

  • Shutting down every node
  • Delete the contents of the nodes table by executing TRUNCATE nodes; on Cassandra
  • Start nodes again (make sure that they all have cluster_ttl_on_failure=60 set)

By doing so, when you forcibly shut down a node it should only logged in the nodes table for 60 seconds, and then it should automatically disappear.

@subnetmarco
Copy link
Member

subnetmarco commented Mar 13, 2017

Even waiting after the cluster_ttl_on_failure delay (kept to the default value of 3600), I still have my ghosts in the table (look in the cql output above: some of the ghosts are more than a day old).

@pamiel in the meanwhile I will be investigating this problem.

You can check anytime the remaining TTL by running:

SELECT cluster_listening_address, TTL(cluster_listening_address) FROM nodes;

@pamiel
Copy link
Contributor Author

pamiel commented Mar 14, 2017

@thefosk Mitigated results...

  • I updated the cluster_ttl_on_failure to 60 but it did not provide visible results... Still log lines + database with ghost lines appearing after a couple of start/stop
  • I then tried another tests, adding a preStop clause in the Kong's Kubernetes deployment file so that I'm sure to execute the kong stop at the end of each node (I was expecting that a correct stop might clean the node reference somewhere...).
    It looked at the beginning that things were going better... but then the ghosts came back, visible through the could not join cluster at xx.xx.xx.xx:7946, if the node does not exist anymore it will be purged automatically log lines, and through the nodes table content.
    Now, I have 30 rows while only 3 nodes are up and running for the last 12 hours.

I just run the SELECT you mentioned. The output is there (I unfortunately started my second test with a ttl of 3600...):

 cluster_listening_address | ttl(cluster_listening_address)
---------------------------+--------------------------------
           10.1.6.163:7946 |                           3590
           10.1.5.112:7946 |                           2401
            10.1.1.78:7946 |                           null
           10.1.12.71:7946 |                            777
            10.1.7.14:7946 |                           null
           10.1.10.27:7946 |                           null
            10.1.5.72:7946 |                           null
           10.1.7.185:7946 |                           null
[...]
(30 rows)

and many other lines with a null value on the TTL, and with a creation date that is more than 12h old.
I tracked one of the counters reaching 0... and for this one, the node disappeared from the table !... but I still have many other lines with a null ttl, and not disappearing...

One complementary information: the could not join cluster at xx.xx.xx.xx:7946, if the node does not exist anymore it will be purged automatically lines only appear when running the very first kong instance of the cluster. If a node is already up and running, starting another node (the 2nd, 3rd... of the cluster) does not display such logs and the startup goes much faster.
If I scale down back to 0 and restart again a first node, then I have again the log lines and the slow startup process...

I will reset my nodes table, set the ttl to 60, keep the preStop to be sure to stop kong properly (unless you tells me not to do so), and start again some tests (especially to see if the ghost corresponds to 1st nodes of the cluster or not).

@subnetmarco
Copy link
Member

If a node is already up and running, starting another node (the 2nd, 3rd... of the cluster) does not display such logs and the startup goes much faster.

Yes this happens because Kong orders the nodes to find the one that has sent the most recent keep-alive, and tries to join that one. If it works, then it will not iterate over the list anymore.

I will investigate the null problem.

@pamiel
Copy link
Contributor Author

pamiel commented Mar 22, 2017

Hi,
I managed to partially solve the issue by using the FQDN in the cassandra contact point: it looks the ghosts were created by Kong instances that had connection issues with Cassandra.
To be frank, I don’t know if they were not able to connect to cassandra at all, or if they more likely had access at the beginning of the startup… and then losing it, but logs of all ghost were having logs such as:

2017/03/14 13:16:58 [error] 90#0: *294 [lua] cluster.lua:20: log(): [cluster] could not retrieve nodes from datastore: [cassandra error] Could not create lock for prepare request: Error locking mutex: timeout, context: ngx.timer

For those instances, a new line was correctly created in the nodes table but the TTL was never set to the 60 value (it stays to null forever). I don’t know why the TTL process of Cassandra is not working properly, but after using the FQDN, then I have no more connectivity issue to cassandra in the Kong’s logs, and then no more ghosts in the db!

So, why am I saying that the issue is “partially solved”? Because this is the upstream connectivity that now fails: I was previously using Routes => no problem; but using host names within the Kubernetes network, then I receive a HTML “Kong Error - An invalid response was received from the upstream server” from Kong, and Kong’s logs says:
[error] 310#0: *1083 connect() failed (113: Host is unreachable) while connecting to upstream

And this time, usage of FQDN in upstream definition is not solving the point :(

As mentioned in my post here (Kong/kong-dist-kubernetes#6), there are indeed tens of similar posts related to (sometimes random) DNS issues, and I still do not understand what is the root cause of the issue (and in which component).

Usage of FQDN looks just to be a workaround (not working for all use cases, as in my case) and as the issue might happen randomly, I’m not sure that it will not happen again later, even with FQDNs…

And I still don’t understand where is the real problem…

@thibaultcha thibaultcha added the task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. label Apr 28, 2017
@pamiel
Copy link
Contributor Author

pamiel commented Dec 1, 2017

Obviously no longer happens on 0.11 as Serf is out and no more nodes table !
So closing it.
Thanks.

@pamiel pamiel closed this as completed Dec 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not.
Projects
None yet
Development

No branches or pull requests

3 participants