Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node unable to join cluster #2182

Closed
tumluliu opened this issue Mar 9, 2017 · 19 comments
Closed

node unable to join cluster #2182

tumluliu opened this issue Mar 9, 2017 · 19 comments

Comments

@tumluliu
Copy link

tumluliu commented Mar 9, 2017

Summary

After upgrading to 0.10. The two nodes can not see each other in one cluster :(

Steps To Reproduce

  1. Install Kong 0.10.0 on a node via deb (let me call it master node although I know there is no master/slave differences in Kong)
  2. Install Kong 0.10.0 on another node via docker (let me call it slave node)

Additional Details & Logs

  • Kong version 0.10.0
  • Ubuntu 16.04

Both of the two nodes are healthy. And I can see them in the database's nodes table as follows:

name cluster_listening_address created_at
1913ec66b39d_0.0.0.0:7946_50510ee9a9914c63965682c52956480d 127.0.0.1:7946 2017-03-09 13:25:57
ors-gateway_0.0.0.0:7946_c1a37772c22843b1bd1e7e442b10e3f3 127.0.0.1:7946 2017-03-09 08:51:06

But they just can not see each other. When executing curl http://127.0.0.1:8001/cluster, I got the messages as follows for each of them:

master node

{
    "data": [
        {
            "address": "127.0.0.1:7946",
            "name": "1913ec66b39d_0.0.0.0:7946_50510ee9a9914c63965682c52956480d",
            "status": "alive"
        }
    ],
    "total": 1
}

slave node

{
    "data": [
        {
            "address": "127.0.0.1:7946",
            "name": "1913ec66b39d_0.0.0.0:7946_50510ee9a9914c63965682c52956480d",
            "status": "alive"
        }
    ],
    "total": 1
}

The curl http://127.0.0.1:8001/cluster/nodes/ returns nothing but

{
    "message": "Not found"
}

And the kong cluster members return only one alive node on both of them. The config related to clustering on the master node:

cluster_listen = 0.0.0.0:7946
cluster_listen_rpc = 127.0.0.1:7373
cluster_ttl_on_failure = 3600
cluster_profile = lan

The docker start command for the slave node:

#!/bin/bash
docker run -d --name kong \
    -e "KONG_LOG_LEVEL=info" \
    -e "KONG_PG_HOST=DB_IP_ADDR" \
    -e "KONG_PG_DATABASE=kong" \
    -e "KONG_PG_USER=kong" \
    -e "KONG_PG_PASSWORD=******" \
    -p 8000:8000 \
    -p 8443:8443 \
    -p 8001:8001 \
    -p 7946:7946 \
    -p 7946:7946/udp \
    kong

What's the possible problem? Any ideas? Thanks a lot!

BTW, they can surely see each other in 0.9.8 before this upgrading

@Tieske
Copy link
Member

Tieske commented Mar 9, 2017

first thought: both Kong nodes think they run on 127.0.0.1, albeit the Docker one is abstracted through an overlay/nat network

@tumluliu
Copy link
Author

tumluliu commented Mar 9, 2017

@Tieske yes they are. But the config were just fine in 0.9.8. Should I change anything in 0.10's config? Anyway, I will try. Thanks a lot!

@Tieske
Copy link
Member

Tieske commented Mar 9, 2017

you probably need to tweak the cluster listen and cluster advertise properties, see https://getkong.org/docs/0.10.x/network/

@tumluliu
Copy link
Author

tumluliu commented Mar 9, 2017

@Tieske ok, problem solved. I just figured out that the cluster_listen and cluster_advertise CAN NOT be just left with the default value 0.0.0.0:7946. I really suggested this to be emphasised in the document's network section because this is changed from 0.9 to 0.10. For anyone has the similar problem, please change the cluster_listen config to your_ip_addr:7946 instead of 0.0.0.0:7946. For the docker case, this is my starting command that works:

docker run -d --name kong \
    -e "KONG_LOG_LEVEL=info" \
    -e "KONG_PG_HOST=YOUR_POSTGRES_DB_SERVER" \
    -e "KONG_PG_DATABASE=kong" \
    -e "KONG_PG_USER=kong" \
    -e "KONG_PG_PASSWORD=YOUR_PASSWORD" \
    -e "KONG_CLUSTER_ADVERTISE=YOUR_HOST_IP_ADDR:7946" \
    -e "KONG_CLUSTER_LISTEN=YOUR_DOCKER_CONTAINER_INTERNAL_IP_ADDR:7946" \
    -p 8000:8000 \
    -p 8443:8443 \
    -p 8001:8001 \
    -p 7946:7946 \
    -p 7946:7946/udp \
    kong

The YOUR_DOCKER_CONTAINER_INTERNAL_IP_ADDR is 172.17.0.2 on my machine. To get this address, you can just run

docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' container_name_or_id

The command is from this answer on stackoverflow.

@jhenry82
Copy link

jhenry82 commented Mar 9, 2017

I just ran into this testing 0.10 as well. The docs and kong behavior are out of sync, not sure which is correct.

cluster_advertise

By default, the cluster_listen address is advertised over the cluster. If the cluster_listen host is '0.0.0.0', then the first local, non-loopback IPv4 address will be advertised to other nodes. However, in some cases (specifically NAT traversal), there may be a routable address that cannot be bound to. This flag enables advertising a different address to support this.

Specifically "the first local, non-loopback IPv4 address will be advertised"

However, with cluster_advertise unset, the node is advertising itself as 127.0.0.1 (I can see it in the "nodes" DB table). My nodes are VM's, not Docker containers, and each has a 10.x.x.x IP available but it is not the IP Kong picked to advertise. As @tumluliu said, this exact config did the right thing in terms of clustering in 0.9.8.

@tumluliu
Copy link
Author

tumluliu commented Mar 9, 2017

@jhenry82 I had exactly the same idea with yours just 5 minutes ago, then I saw your reply 😆

I really think this is a bug that existed from 0.10 rc3 as #2037 mentions. I did a little bit digging, but could not find where this cluster_listen conf value has been set and stored to the database. What I have found is just in the function normalize_ipv4, the 0.0.0.0 was not given any special processing. I don't know how could @Tieske ensure it can be interpreted to the first non-loopback ipv4 address instead of 127.0.0.1. Actually, when I tried pinging 0.0.0.0 on my local machine, I got

PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.019 ms
...

btw, in 0.9.8, I don't have such issue.

@tumluliu
Copy link
Author

tumluliu commented Mar 9, 2017

just notice that there is a lua-ip project from your team to get the first non-loopback ipv4 address. why not using it in Kong?

@thibaultcha
Copy link
Member

I believe this behavior came directly from Serf, and Kong 0.10.0 bumps the Serf version from 0.8.0 to 0.8.1, so that could be a reason why. More details:

  • Here is where the node name you see in the DB comes from: prefix_handler.lua
  • Here is where the node inserts itself in the DB: serf.lua
  • The IP inserted in the cluster_listening_address column comes from running the underlying serf members command on the local Serf agent started by Kong: serf.lua

just notice that there is a lua-ip project from your team to get the first non-loopback ipv4 address. why not using it in Kong?

This is legacy, untested, and also undesired as it introduces one more of those C module dependencies we are trying to get rid of.

@thibaultcha
Copy link
Member

@tumluliu @jhenry82 To confirm, what is the result of serf members on your node, ideally, with both 0.9.8 and 0.10.0? That would be really helpful. Thanks!

@tumluliu
Copy link
Author

@thibaultcha thanks a lot for your information. For the moment, I can only provide the serf agent and serf members results for the machine with Kong 0.10 whose cluster_listen is set to 0.0.0.0:7946:

$ serf agent
==> Starting Serf agent...
==> Starting Serf agent RPC...
==> Serf agent running!
         Node name: 'ors-gateway'
         Bind addr: '0.0.0.0:7946'
          RPC addr: '127.0.0.1:7373'
         Encrypted: false
          Snapshot: false
           Profile: lan

==> Log data will now stream in as it occurs:

    2017/03/10 09:12:02 [INFO] agent: Serf agent starting
    2017/03/10 09:12:02 [INFO] serf: EventMemberJoin: ors-gateway 127.0.0.1
    2017/03/10 09:12:03 [INFO] agent: Received event: member-join
$ serf members
ors-gateway_0.0.0.0:7946_c1a37772c22843b1bd1e7e442b10e3f3  127.0.0.1:7946  alive

In the docker container of the slave node

$ docker exec kong serf members
40ebb2b13c40_0.0.0.0:7946_e7b02c6b941e405eb2198574598e9c11  127.0.0.1:7946  alive

Unfortunately, I haven't found an appropriate machine to test the 0.9.8. I tried installing it on my macbook with the pkg package. But serf can not find the private ip address with this error message:

$ serf agent
==> Starting Serf agent...
==> Failed to start the Serf agent: Error creating Serf: Failed to create memberlist: No private IP address found, and explicit IP not provided

BTW, the serf version for my kong-0.10 is 0.8.0, while for the kong-0.9.8 is 0.7.0. When I find another machine can test 0.9.8, I will post the serf members results. Anyway, I think the reason becomes clearer: serf wrongly returns 127.0.0.1 as the first non-loopback local ipv4 address.

@tumluliu
Copy link
Author

tumluliu commented Mar 10, 2017

I have got the serf members information from 0.9.8 kong docker container:

$ docker exec kong-0.9.8 serf version
Serf v0.7.0
Agent Protocol: 4 (Understands back to: 2)
$ docker exec kong-0.9.8 serf members
fdebb3fd63aa_0.0.0.0:7946_d73429383a0f438cb033c99932ea8469  172.17.0.2:7946    alive
ors-gateway_0.0.0.0:7946_c1a37772c22843b1bd1e7e442b10e3f3   192.168.2.17:7946  alive

The ipv4 address 172.17.0.2 got by Serf 0.7.0 is correct. And you can also see the master node with cluster_listen explicitly set to its local ip 192.168.2.17 can be found by this kong node. Hope that helps, @thibaultcha .

@TransactCharlie
Copy link

TransactCharlie commented Mar 15, 2017

Hi everyone.

I'm using docker-compose for local dev and we use AWS ECS as our production docker clustering solution. This makes working out what the cluster_listen address is problematic.

Until this bug is fixed we've come up with the following workaround:

I ended up writing a thin wrapper that we use before we start kong. It requires the net-tools package be installed on top of the official kong one (yum install -y net-tools)

this works out the correct address to use at runtime

#!/bin/sh
IP_ADDR=`ifconfig eth0 | awk '$1 == "inet" {gsub(/\/.*$/, "", $2); print $2}'`
echo "SETTING IP_ADDR FOR KONG CLUSTERING TO: ${IP_ADDR}"

export KONG_CLUSTER_LISTEN="${IP_ADDR}:7946"
<start kong>

zwmlzaq added a commit to zwmlzaq/docker-kong that referenced this issue Mar 20, 2017
Append solution on Kong/kong#2182 for clustering...
@udangel-r7
Copy link

i think hashicorp/memberlist#102 is the underlying issue and serf 0.8.2 should address this

@Tieske
Copy link
Member

Tieske commented Mar 24, 2017

as we're planning to remove the Serf dependency all together, we're now leaning towards reverting Serf back to version 0.7 for the next release (0.10.1)

@udangel-r7
Copy link

@Tieske what are you planning to replace serf with? Hitting the database more often? I am wondering because we are looking into deployment of Kong for more endpoints

@MrSaints
Copy link

MrSaints commented Mar 27, 2017

@TransactCharlie's trick worked. Though, since I'm using it inside an AWS AMI, I found hostname -I | cut -d' ' -f1 to be a lot more elegant when used with user data (but there are still lots of steps involved to pass this to the container itself).

EDIT: On ECS, use 0.10.1, and set the network mode to host to save yourself time.

@saamalik
Copy link

saamalik commented Apr 21, 2017

Leaving here as a note for other Docker users. Most of the times, Kong would not properly start as soon as we introduced a persistent postgres volume.

Long story short, we were also publishing a port the docker service which causes the service containers to have two networks: user-overlay network and the default ingress network. Now with both networks attached, the container will have at least two interfaces with non-default IPV4 addresses; which causes serf to randomly pick one of them on startup. Again keeping it short, but this resulted in sometimes the ingress address to be selected which resulted in even weirder behavior (e.g: kong startup would just hang).

Anyway, in docker world we were able to use the following solution (without resorting to net-tools and grep):

IPADDR=$(hostname -i)

echo "Starting Kong..."
export KONG_CLUSTER_LISTEN="${IPADDR}:7946"
kong start --vv

The above as added to a custom Dockerfile built ontop of the Mashape/kong Dockerfile.

@thibaultcha
Copy link
Member

Considering this resolved as Kong 0.10.01 was shipped with a Serf downgrade back to 0.7.0. Future versions of Kong will not even need Serf anymore :)

Thanks!

@endeepak
Copy link

@thibaultcha This is reproducible with kong 0.9.9 which comes with serf 0.7.0. Faced this clustering issue Kong/docker-kong#93 with kong in docker swarm

Workaround #2182 (comment) by @saamalik fixed it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants