node unable to join cluster #2182

tumluliu · 2017-03-09T14:39:20Z

Summary

After upgrading to 0.10. The two nodes can not see each other in one cluster :(

Steps To Reproduce

Install Kong 0.10.0 on a node via deb (let me call it master node although I know there is no master/slave differences in Kong)
Install Kong 0.10.0 on another node via docker (let me call it slave node)

Additional Details & Logs

Kong version 0.10.0
Ubuntu 16.04

Both of the two nodes are healthy. And I can see them in the database's nodes table as follows:

name	cluster_listening_address	created_at
1913ec66b39d_0.0.0.0:7946_50510ee9a9914c63965682c52956480d	127.0.0.1:7946	2017-03-09 13:25:57
ors-gateway_0.0.0.0:7946_c1a37772c22843b1bd1e7e442b10e3f3	127.0.0.1:7946	2017-03-09 08:51:06

But they just can not see each other. When executing curl http://127.0.0.1:8001/cluster, I got the messages as follows for each of them:

master node

{
    "data": [
        {
            "address": "127.0.0.1:7946",
            "name": "1913ec66b39d_0.0.0.0:7946_50510ee9a9914c63965682c52956480d",
            "status": "alive"
        }
    ],
    "total": 1
}

slave node

{
    "data": [
        {
            "address": "127.0.0.1:7946",
            "name": "1913ec66b39d_0.0.0.0:7946_50510ee9a9914c63965682c52956480d",
            "status": "alive"
        }
    ],
    "total": 1
}

The curl http://127.0.0.1:8001/cluster/nodes/ returns nothing but

{
    "message": "Not found"
}

And the kong cluster members return only one alive node on both of them. The config related to clustering on the master node:

cluster_listen = 0.0.0.0:7946
cluster_listen_rpc = 127.0.0.1:7373
cluster_ttl_on_failure = 3600
cluster_profile = lan

The docker start command for the slave node:

#!/bin/bash
docker run -d --name kong \
    -e "KONG_LOG_LEVEL=info" \
    -e "KONG_PG_HOST=DB_IP_ADDR" \
    -e "KONG_PG_DATABASE=kong" \
    -e "KONG_PG_USER=kong" \
    -e "KONG_PG_PASSWORD=******" \
    -p 8000:8000 \
    -p 8443:8443 \
    -p 8001:8001 \
    -p 7946:7946 \
    -p 7946:7946/udp \
    kong

What's the possible problem? Any ideas? Thanks a lot!

BTW, they can surely see each other in 0.9.8 before this upgrading

The text was updated successfully, but these errors were encountered:

Tieske · 2017-03-09T14:55:28Z

first thought: both Kong nodes think they run on 127.0.0.1, albeit the Docker one is abstracted through an overlay/nat network

tumluliu · 2017-03-09T15:00:10Z

@Tieske yes they are. But the config were just fine in 0.9.8. Should I change anything in 0.10's config? Anyway, I will try. Thanks a lot!

Tieske · 2017-03-09T15:03:26Z

you probably need to tweak the cluster listen and cluster advertise properties, see https://getkong.org/docs/0.10.x/network/

tumluliu · 2017-03-09T16:31:37Z

@Tieske ok, problem solved. I just figured out that the cluster_listen and cluster_advertise CAN NOT be just left with the default value 0.0.0.0:7946. I really suggested this to be emphasised in the document's network section because this is changed from 0.9 to 0.10. For anyone has the similar problem, please change the cluster_listen config to your_ip_addr:7946 instead of 0.0.0.0:7946. For the docker case, this is my starting command that works:

docker run -d --name kong \
    -e "KONG_LOG_LEVEL=info" \
    -e "KONG_PG_HOST=YOUR_POSTGRES_DB_SERVER" \
    -e "KONG_PG_DATABASE=kong" \
    -e "KONG_PG_USER=kong" \
    -e "KONG_PG_PASSWORD=YOUR_PASSWORD" \
    -e "KONG_CLUSTER_ADVERTISE=YOUR_HOST_IP_ADDR:7946" \
    -e "KONG_CLUSTER_LISTEN=YOUR_DOCKER_CONTAINER_INTERNAL_IP_ADDR:7946" \
    -p 8000:8000 \
    -p 8443:8443 \
    -p 8001:8001 \
    -p 7946:7946 \
    -p 7946:7946/udp \
    kong

The YOUR_DOCKER_CONTAINER_INTERNAL_IP_ADDR is 172.17.0.2 on my machine. To get this address, you can just run

docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' container_name_or_id

The command is from this answer on stackoverflow.

jhenry82 · 2017-03-09T18:31:59Z

I just ran into this testing 0.10 as well. The docs and kong behavior are out of sync, not sure which is correct.

cluster_advertise

By default, the cluster_listen address is advertised over the cluster. If the cluster_listen host is '0.0.0.0', then the first local, non-loopback IPv4 address will be advertised to other nodes. However, in some cases (specifically NAT traversal), there may be a routable address that cannot be bound to. This flag enables advertising a different address to support this.

Specifically "the first local, non-loopback IPv4 address will be advertised"

However, with cluster_advertise unset, the node is advertising itself as 127.0.0.1 (I can see it in the "nodes" DB table). My nodes are VM's, not Docker containers, and each has a 10.x.x.x IP available but it is not the IP Kong picked to advertise. As @tumluliu said, this exact config did the right thing in terms of clustering in 0.9.8.

tumluliu · 2017-03-09T20:01:41Z

@jhenry82 I had exactly the same idea with yours just 5 minutes ago, then I saw your reply 😆

I really think this is a bug that existed from 0.10 rc3 as #2037 mentions. I did a little bit digging, but could not find where this cluster_listen conf value has been set and stored to the database. What I have found is just in the function normalize_ipv4, the 0.0.0.0 was not given any special processing. I don't know how could @Tieske ensure it can be interpreted to the first non-loopback ipv4 address instead of 127.0.0.1. Actually, when I tried pinging 0.0.0.0 on my local machine, I got

PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.019 ms
...

btw, in 0.9.8, I don't have such issue.

tumluliu · 2017-03-09T21:12:45Z

just notice that there is a lua-ip project from your team to get the first non-loopback ipv4 address. why not using it in Kong?

thibaultcha · 2017-03-09T22:27:25Z

I believe this behavior came directly from Serf, and Kong 0.10.0 bumps the Serf version from 0.8.0 to 0.8.1, so that could be a reason why. More details:

Here is where the node name you see in the DB comes from: prefix_handler.lua
Here is where the node inserts itself in the DB: serf.lua
The IP inserted in the cluster_listening_address column comes from running the underlying serf members command on the local Serf agent started by Kong: serf.lua

just notice that there is a lua-ip project from your team to get the first non-loopback ipv4 address. why not using it in Kong?

This is legacy, untested, and also undesired as it introduces one more of those C module dependencies we are trying to get rid of.

thibaultcha · 2017-03-09T22:31:31Z

@tumluliu @jhenry82 To confirm, what is the result of serf members on your node, ideally, with both 0.9.8 and 0.10.0? That would be really helpful. Thanks!

tumluliu · 2017-03-10T09:47:47Z

@thibaultcha thanks a lot for your information. For the moment, I can only provide the serf agent and serf members results for the machine with Kong 0.10 whose cluster_listen is set to 0.0.0.0:7946:

$ serf agent
==> Starting Serf agent...
==> Starting Serf agent RPC...
==> Serf agent running!
         Node name: 'ors-gateway'
         Bind addr: '0.0.0.0:7946'
          RPC addr: '127.0.0.1:7373'
         Encrypted: false
          Snapshot: false
           Profile: lan

==> Log data will now stream in as it occurs:

    2017/03/10 09:12:02 [INFO] agent: Serf agent starting
    2017/03/10 09:12:02 [INFO] serf: EventMemberJoin: ors-gateway 127.0.0.1
    2017/03/10 09:12:03 [INFO] agent: Received event: member-join

$ serf members
ors-gateway_0.0.0.0:7946_c1a37772c22843b1bd1e7e442b10e3f3  127.0.0.1:7946  alive

In the docker container of the slave node

$ docker exec kong serf members
40ebb2b13c40_0.0.0.0:7946_e7b02c6b941e405eb2198574598e9c11  127.0.0.1:7946  alive

Unfortunately, I haven't found an appropriate machine to test the 0.9.8. I tried installing it on my macbook with the pkg package. But serf can not find the private ip address with this error message:

$ serf agent
==> Starting Serf agent...
==> Failed to start the Serf agent: Error creating Serf: Failed to create memberlist: No private IP address found, and explicit IP not provided

BTW, the serf version for my kong-0.10 is 0.8.0, while for the kong-0.9.8 is 0.7.0. When I find another machine can test 0.9.8, I will post the serf members results. Anyway, I think the reason becomes clearer: serf wrongly returns 127.0.0.1 as the first non-loopback local ipv4 address.

tumluliu · 2017-03-10T10:40:49Z

I have got the serf members information from 0.9.8 kong docker container:

$ docker exec kong-0.9.8 serf version
Serf v0.7.0
Agent Protocol: 4 (Understands back to: 2)
$ docker exec kong-0.9.8 serf members
fdebb3fd63aa_0.0.0.0:7946_d73429383a0f438cb033c99932ea8469  172.17.0.2:7946    alive
ors-gateway_0.0.0.0:7946_c1a37772c22843b1bd1e7e442b10e3f3   192.168.2.17:7946  alive

The ipv4 address 172.17.0.2 got by Serf 0.7.0 is correct. And you can also see the master node with cluster_listen explicitly set to its local ip 192.168.2.17 can be found by this kong node. Hope that helps, @thibaultcha .

TransactCharlie · 2017-03-15T12:45:37Z

Hi everyone.

I'm using docker-compose for local dev and we use AWS ECS as our production docker clustering solution. This makes working out what the cluster_listen address is problematic.

Until this bug is fixed we've come up with the following workaround:

I ended up writing a thin wrapper that we use before we start kong. It requires the net-tools package be installed on top of the official kong one (yum install -y net-tools)

this works out the correct address to use at runtime

#!/bin/sh
IP_ADDR=`ifconfig eth0 | awk '$1 == "inet" {gsub(/\/.*$/, "", $2); print $2}'`
echo "SETTING IP_ADDR FOR KONG CLUSTERING TO: ${IP_ADDR}"

export KONG_CLUSTER_LISTEN="${IP_ADDR}:7946"
<start kong>

Append solution on Kong/kong#2182 for clustering...

udangel-r7 · 2017-03-24T16:00:58Z

i think hashicorp/memberlist#102 is the underlying issue and serf 0.8.2 should address this

Tieske · 2017-03-24T17:41:25Z

as we're planning to remove the Serf dependency all together, we're now leaning towards reverting Serf back to version 0.7 for the next release (0.10.1)

udangel-r7 · 2017-03-25T07:13:25Z

@Tieske what are you planning to replace serf with? Hitting the database more often? I am wondering because we are looking into deployment of Kong for more endpoints

MrSaints · 2017-03-27T15:45:57Z

@TransactCharlie's trick worked. Though, since I'm using it inside an AWS AMI, I found hostname -I | cut -d' ' -f1 to be a lot more elegant when used with user data (but there are still lots of steps involved to pass this to the container itself).

EDIT: On ECS, use 0.10.1, and set the network mode to host to save yourself time.

saamalik · 2017-04-21T05:15:09Z

Leaving here as a note for other Docker users. Most of the times, Kong would not properly start as soon as we introduced a persistent postgres volume.

Long story short, we were also publishing a port the docker service which causes the service containers to have two networks: user-overlay network and the default ingress network. Now with both networks attached, the container will have at least two interfaces with non-default IPV4 addresses; which causes serf to randomly pick one of them on startup. Again keeping it short, but this resulted in sometimes the ingress address to be selected which resulted in even weirder behavior (e.g: kong startup would just hang).

Anyway, in docker world we were able to use the following solution (without resorting to net-tools and grep):

IPADDR=$(hostname -i)

echo "Starting Kong..."
export KONG_CLUSTER_LISTEN="${IPADDR}:7946"
kong start --vv

The above as added to a custom Dockerfile built ontop of the Mashape/kong Dockerfile.

thibaultcha · 2017-04-28T20:36:30Z

Considering this resolved as Kong 0.10.01 was shipped with a Serf downgrade back to 0.7.0. Future versions of Kong will not even need Serf anymore :)

Thanks!

endeepak · 2017-07-31T10:14:02Z

@thibaultcha This is reproducible with kong 0.9.9 which comes with serf 0.7.0. Faced this clustering issue Kong/docker-kong#93 with kong in docker swarm

Workaround #2182 (comment) by @saamalik fixed it. Thanks!

jhenry82 mentioned this issue Mar 9, 2017

cluster_advertise documentation inconsistent #2164

Closed

thibaultcha mentioned this issue Mar 9, 2017

Kong cluster advertise choosing loopback address when cluster_listen is 0.0.0.0 #2037

Closed

pamiel mentioned this issue Mar 10, 2017

“Ghost” kong instances never disappear and drastically slow down the startup process #2192

Closed

zwmlzaq added a commit to zwmlzaq/docker-kong that referenced this issue Mar 20, 2017

Update Dockerfile

04afdc9

Append solution on Kong/kong#2182 for clustering...

Tieske mentioned this issue Mar 30, 2017

Getting errors on starting multi node cluster #2300

Closed

thibaultcha closed this as completed Apr 28, 2017

endeepak mentioned this issue Jul 31, 2017

Inconsistent cluster state when running kong cluster in docker swarm Kong/docker-kong#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node unable to join cluster #2182

node unable to join cluster #2182

tumluliu commented Mar 9, 2017 •

edited

Loading

Tieske commented Mar 9, 2017

tumluliu commented Mar 9, 2017

Tieske commented Mar 9, 2017

tumluliu commented Mar 9, 2017 •

edited

Loading

jhenry82 commented Mar 9, 2017 •

edited

Loading

tumluliu commented Mar 9, 2017 •

edited

Loading

tumluliu commented Mar 9, 2017

thibaultcha commented Mar 9, 2017

thibaultcha commented Mar 9, 2017

tumluliu commented Mar 10, 2017

tumluliu commented Mar 10, 2017 •

edited

Loading

TransactCharlie commented Mar 15, 2017 •

edited

Loading

udangel-r7 commented Mar 24, 2017

Tieske commented Mar 24, 2017

udangel-r7 commented Mar 25, 2017

MrSaints commented Mar 27, 2017 •

edited

Loading

saamalik commented Apr 21, 2017 •

edited

Loading

thibaultcha commented Apr 28, 2017

endeepak commented Jul 31, 2017

node unable to join cluster #2182

node unable to join cluster #2182

Comments

tumluliu commented Mar 9, 2017 • edited Loading

Summary

Steps To Reproduce

Additional Details & Logs

Tieske commented Mar 9, 2017

tumluliu commented Mar 9, 2017

Tieske commented Mar 9, 2017

tumluliu commented Mar 9, 2017 • edited Loading

jhenry82 commented Mar 9, 2017 • edited Loading

tumluliu commented Mar 9, 2017 • edited Loading

tumluliu commented Mar 9, 2017

thibaultcha commented Mar 9, 2017

thibaultcha commented Mar 9, 2017

tumluliu commented Mar 10, 2017

tumluliu commented Mar 10, 2017 • edited Loading

TransactCharlie commented Mar 15, 2017 • edited Loading

udangel-r7 commented Mar 24, 2017

Tieske commented Mar 24, 2017

udangel-r7 commented Mar 25, 2017

MrSaints commented Mar 27, 2017 • edited Loading

saamalik commented Apr 21, 2017 • edited Loading

thibaultcha commented Apr 28, 2017

endeepak commented Jul 31, 2017

tumluliu commented Mar 9, 2017 •

edited

Loading

tumluliu commented Mar 9, 2017 •

edited

Loading

jhenry82 commented Mar 9, 2017 •

edited

Loading

tumluliu commented Mar 9, 2017 •

edited

Loading

tumluliu commented Mar 10, 2017 •

edited

Loading

TransactCharlie commented Mar 15, 2017 •

edited

Loading

MrSaints commented Mar 27, 2017 •

edited

Loading

saamalik commented Apr 21, 2017 •

edited

Loading