Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clients flap after upgrading to 0.4.1 from 0.3.2 #1641

Closed
ketzacoatl opened this issue Aug 22, 2016 · 30 comments
Closed

clients flap after upgrading to 0.4.1 from 0.3.2 #1641

ketzacoatl opened this issue Aug 22, 2016 · 30 comments

Comments

@ketzacoatl
Copy link
Contributor

I have Terraform modules to deploy consul/nomad servers and clients as auto-scaling groups on AWS. When I run nomad 0.3.2 with consul 0.6.4, and my own form of auto-discovery for the nomad servers and clients, nomad is stable and has run well. After upgrading to 0.4.1, I updated the config templates I have, to ensure nomad would use its native auto-discovery thru consul (and to disable my previous form of auto-discovery with consul). With that fiddling, I found the nomad servers to be stable, and reliably boot and form their quorum. However, I have been unable to get the clients to stabilize, I see them flap up and down (status from down to ready and back).

Nomad version

0.4.1

Operating system and Environment details

Linux Ubuntu 14.04 AMD64, 3.13.0-74-generic, consul 0.6.4, upstart, nomad linux executable (not docker).

Issue

I expect the nomad clients to auto-discover their servers through consul, and I see log messages from the consul agent to confirm that is happening:

    2016/08/21 17:14:10.892388 [WARN] client.rpcproxy: No servers available
    2016/08/21 17:14:10.892500 [WARN] client.rpcproxy: No servers available
    2016/08/21 17:14:10.892749 [ERR] client: failed to query for node allocations: no known servers
    2016/08/21 17:14:10.924250 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:10.930018 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:15.290166 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:15.297537 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:16.501042 [DEBUG] http: Request /v1/agent/servers (557.928µs)
    2016/08/21 17:14:20.228296 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:20.233438 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:24.998500 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:25.003561 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:26.503302 [DEBUG] http: Request /v1/agent/servers (1.293321ms)
    2016/08/21 17:14:29.488031 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:29.492944 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:31.446417 [DEBUG] client: node registration complete

At that point, the client can report out about jobs:

client# nomad status
No running jobs

and the node can be seen on a nomad server:

server# nomad node-status
ID      DC             Name    Class   Drain  Status
537...  foo.us-east-1  ...     <none>  false  ready
c05...  foo.us-east-1  ...     <none>  false  down
eb1...  foo.us-east-1  ...     <none>  false  ready
9b6...  foo.us-east-1  ...     <none>  false  ready
9fc...  foo.us-east-1  ...     <none>  false  ready
c3f...  foo.us-east-1  ...     <none>  false  down
834...  foo.us-east-1  ...     <none>  false  ready
11d...  foo.us-east-1  ...     <none>  false  down

(these nodes ^ are flapping, with some up and some down, eventually they'll all be down for a while, before coming back online for short bits of time)

Eventually, the client goes down, and nomad status looks like:

client# nomad status
Error querying jobs: Unexpected response code: 500 (no known servers)

and the client logs says:

    2016/08/22 18:52:43.457042 [WARN] client.rpcproxy: No servers available
    2016/08/22 18:52:43.457201 [ERR] client: heartbeating failed: failed to update status: no known servers
    2016/08/22 18:52:50.894438 [DEBUG] http: Request /v1/agent/servers (78.711µs)
    2016/08/22 18:53:00.895614 [DEBUG] http: Request /v1/agent/servers (234.545µs)
    2016/08/22 18:53:07.242744 [WARN] client.rpcproxy: No servers available
    2016/08/22 18:53:07.242820 [ERR] http: Request /v1/jobs, error: no known servers

If I wait a little while, the client will re-register to the servers auto-discovered via consul (same logs as shown above), and nomad status will work. The clients seem to be up for 30 seconds or so, and down for about the same, or maybe a little longer. Eventually, you'll catch them all down:

ID      DC             Name     Class   Drain  Status
537...  foo.us-east-1  ...      <none>  false  down
c05...  foo.us-east-1  ...      <none>  false  down
eb1...  foo.us-east-1  ...      <none>  false  down
9b6...  foo.us-east-1  ...      <none>  false  down
9fc...  foo.us-east-1  ...      <none>  false  down
c3f...  foo.us-east-1  ...      <none>  false  down
834...  foo.us-east-1  ...      <none>  false  down
11d...  foo.us-east-1  ...      <none>  false  down

But then they'll start re-registering.

Reproduction steps

  1. deploy nomad to a cluster of servers, 3 nodes, using the config included below
  2. deploy a cluster running the nomad client, using the config included below
  3. tail the logs and watch the client / servers each, they will auto-discover and then fail

I am running the executable on the host directly, with ubuntu 14.04 and upstart:

respawn
chdir /var/lib/nomad
script
  exec /usr/local/bin/nomad agent -client -config /etc/nomad/config.json -config /etc/nomad/conf.d/
end script

Nomad Server config

{
  "addresses": {"serf": "10.10.11.6","rpc": "10.10.11.6",
    "http": "10.10.11.6"
  },
  "datacenter": "foo.us-east-1",
  "data_dir": "/var/lib/nomad/tmp/",
  "enable_debug": false,
  "enable_syslog": true,
  "log_level": "DEBUG",
  "name": "server...",
  "region": "us-east-1",
  "syslog_facility": "LOCAL0",
  "consul": {
    "address": "10.10.11.6:8500",
    "auto_advertise": true,
    "client_auto_join": true,
    "server_auto_join": true,
    "server_service_name": "nomad",
    "client_service_name": "nomad-client",
    "token": "FOOBAR"
  },
  "server": {
    "bootstrap_expect": 3,
    "retry_join": [
      "nomad.service.consul"
      ],
    "retry_interval": "15s",
    "enabled": true 
  }
}

Nomad Client config

{
  "addresses": {
    "http": "10.10.13.141"
  },
  "datacenter": "foo.us-east-1",
  "data_dir": "/var/lib/nomad/tmp/",
  "enable_debug": false,
  "enable_syslog": true,
  "log_level": "DEBUG",
  "name": "client...",
  "region": "us-east-1",
  "syslog_facility": "LOCAL0",
  "consul": {
    "address": "127.0.0.1:8500",
    "auto_advertise": true,
    "client_auto_join": true,
    "server_auto_join": true,
    "server_service_name": "nomad",
    "client_service_name": "nomad-client",
    "token": "FOOBAR"
  },
  "client": {
    "options": {
      "driver.raw_exec.enable": "1"
    },
    "meta": {
      "tier": "..."
    },
    "enabled": true 
  }
}

Nomad Client logs

Here is a client starting up, searching consul for nomad, finding nomad, registering with the servers, and then losing that connection and failing to find any servers:

    Loaded configuration from /etc/nomad/config.json
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: TRACE
                Region: us-east-1 (DC: foo.us-east-1)
                Server: false

==> Nomad agent started! Log data will stream in below:

    2016/08/21 17:14:10.876786 [INFO] client: using state directory /var/lib/nomad/tmp/client
    2016/08/21 17:14:10.876879 [INFO] client: using alloc directory /var/lib/nomad/tmp/alloc
    2016/08/21 17:14:10.876910 [DEBUG] client: built-in fingerprints: [arch cgroup cpu env_aws env_gce host memory network nomad storage]
    2016/08/21 17:14:10.877085 [INFO] fingerprint.cgroups: cgroups are available
    2016/08/21 17:14:10.877235 [DEBUG] fingerprint.cpu: frequency: 2400 MHz
    2016/08/21 17:14:10.877258 [DEBUG] fingerprint.cpu: core count: 2
    2016/08/21 17:14:10.877353 [DEBUG] client: fingerprinting cgroup every 15s
    2016/08/21 17:14:10.887966 [DEBUG] fingerprint.env_gce: Could not read value for attribute "machine-type"
    2016/08/21 17:14:10.889513 [DEBUG] fingerprint.network: Detected interface eth0 with IP 10.10.13.141 during fingerprinting
    2016/08/21 17:14:10.890253 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
    2016/08/21 17:14:10.890360 [DEBUG] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
    2016/08/21 17:14:10.890418 [DEBUG] fingerprint.network: Unable to read link speed; setting to default 100
    2016/08/21 17:14:10.891269 [DEBUG] client: applied fingerprints [arch cgroup cpu env_aws host memory network nomad storage]
    2016/08/21 17:14:10.891387 [DEBUG] driver.docker: using client connection initialized from environment
    2016/08/21 17:14:10.892040 [DEBUG] driver.exec: exec driver is enabled
    2016/08/21 17:14:10.892143 [DEBUG] client: available drivers [docker exec]
    2016/08/21 17:14:10.892282 [DEBUG] client: fingerprinting docker every 15s
    2016/08/21 17:14:10.892339 [DEBUG] client: fingerprinting exec every 15s
    2016/08/21 17:14:10.892388 [WARN] client.rpcproxy: No servers available
    2016/08/21 17:14:10.892500 [WARN] client.rpcproxy: No servers available
    2016/08/21 17:14:10.892749 [ERR] client: failed to query for node allocations: no known servers
    2016/08/21 17:14:10.924250 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:10.930018 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:15.290166 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:15.297537 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:16.501042 [DEBUG] http: Request /v1/agent/servers (557.928µs)
    2016/08/21 17:14:20.228296 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:20.233438 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:24.998500 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:25.003561 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:26.503302 [DEBUG] http: Request /v1/agent/servers (1.293321ms)
    2016/08/21 17:14:29.488031 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:29.492944 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:31.446417 [DEBUG] client: node registration complete
    2016/08/21 17:14:31.446500 [DEBUG] client: periodically checking for node changes at duration 5s
    2016/08/21 17:14:34.233183 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["us-east-1"]
    2016/08/21 17:14:34.237880 [DEBUG] client.consul: bootstrap adding following Servers: ["10.10.10.13:4647" "10.10.10.12:4647" "10.10.11.6:4647"]
    2016/08/21 17:14:36.504826 [DEBUG] http: Request /v1/agent/servers (285.277µs)
    2016/08/21 17:14:37.782610 [DEBUG] client: state updated to ready
    2016/08/21 17:14:46.506000 [DEBUG] http: Request /v1/agent/servers (94.024µs)
    2016/08/21 17:14:47.878397 [WARN] client.rpcproxy: No servers available
    2016/08/21 17:14:47.878540 [ERR] client: heartbeating failed: failed to update status: no known servers
    2016/08/21 17:14:49.652303 [WARN] client.rpcproxy: No servers available
    2016/08/21 17:14:49.652400 [ERR] http: Request /v1/jobs, error: no known servers
    2016/08/21 17:14:49.652665 [DEBUG] http: Request /v1/jobs (366.805µs)
    2016/08/21 17:14:56.507070 [DEBUG] http: Request /v1/agent/servers (276.932µs)
    2016/08/21 17:14:56.785218 [WARN] client.rpcproxy: No servers available
    2016/08/21 17:14:56.785287 [ERR] client: failed to query for node allocations: no known servers
@ketzacoatl
Copy link
Contributor Author

Two additional points that I find interesting, but which may be red-herrings:

  • Before upgrading to 0.4.1, I first built and deployed 0.4.1-dev before 0.4.1 was released. I deployed 0.4.1-dev on top of a cluster that had been running 0.3.2 (in-place upgrade). I did not make any config changes for consul or auto-discovery, and the cluster ran my jobs ok. I started seeing problems when I deployed a clean/new cluster with 0.4.1 as a fresh deploy. I can rm -rf the cluster and see the same results with 0.4.1 repeatedly (though I'm suspect of my configuration too).
  • I worked on this issue yesterday, and left the cluster idle on its own for ~14 or 16 hours. When I came back, it looked like all nodes were fine, online, and not flapping. When I restarted the noamd agent on all hosts in the cluster, the flapping came back. I'm leaving the cluster idle again to see if they begin working again.

@ketzacoatl
Copy link
Contributor Author

ketzacoatl commented Aug 23, 2016

  • Reverting to a clean deploy of a cluster running 0.3.2 works as expected, no issues with client agents flapping.

@dadgar
Copy link
Contributor

dadgar commented Aug 23, 2016

Yeah 0.3.2 had no auto-clustering.

@ketzacoatl
Copy link
Contributor Author

I had a "home-brew" auto-clustering based on consul and watches and formula to generate the configs so clients found servers through consul. I disabled my auto-clustering method for deploying 0.4.1, and the client/server find each other, but the client flaps.

@dadgar dadgar added this to the v0.5.0 milestone Aug 23, 2016
@schmichael
Copy link
Member

@ketzacoatl If you get a chance could you dump consul's view of the world while clients are having this problem? Is it possible there's stale entries in consul causing clients to flap?

http://127.0.0.1:8500/v1/catalog/service/nomad?pretty on a node with consul should do it.

@magiconair
Copy link

I'm having this problem also on a completely clean consul/nomad setup. Just ran terraform to setup a consul/nomad cluster and the nomad clients are flapping. consul 0.6.4, nomad 0.4.1, Ubuntu 14.04

@magiconair
Copy link

I've found the problem in my setup. The example docs https://www.nomadproject.io/docs/cluster/bootstrapping.html explain how to use nomad with consul. The server section is

server {
    enabled = true
    bootstrap_expect = 3
}

and the client section is

# Replace with the relevant datacenter.
datacenter = "dc1"

client {
    enabled = true
}

The server section does not contain a datacenter statement. I've copied that verbatim which made the clients flap. Adding datacenter = "dc1" to the server.hcl fixed this for me.

base.hcl

region = "ams1"
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
advertise {
    http = "x.x.x.x:4646"
    rpc = "x.x.x.x:4647"
    serf = "x.x.x.x:4648"
}
enable_debug = true

server.hcl

datacenter = "ams1"
server {
    enabled = true
    bootstrap_expect = 3
}

client.hcl

datacenter = "ams1"
client {
    enabled = true
}

@magiconair
Copy link

I should note that ams1 is also the datacenter I've configured in consul.

@dadgar
Copy link
Contributor

dadgar commented Sep 15, 2016

@magiconair Could you share the TF when it was flapping? We are trying to reproduce this issue

@schmichael
Copy link
Member

@magiconair Very interesting; thanks for the info! We treat servers from other datacenters as backups in the client and use them only if there are no servers in the same DC. I'm guessing the flapping you're seeing is a bug in this backup server code. I'm actually working on that now, so that should work better in 0.5.

Fixed docs in #1712

Sadly it looks like @ketzacoatl is experiencing another issue though as the DCs in his confs match.

@magiconair
Copy link

Maybe the issue is between region and datacenter (just wild guessing here). @dadgar I can reproduce this reliably. What do you mean by TF?

@magiconair
Copy link

@schmichael @ketzacoatl maybe the datacenter value is different than the one configured in consul?

@ketzacoatl
Copy link
Contributor Author

Interesting find @magiconair, I will make some time this weekend to retest and see what I can find based on that.

@ketzacoatl
Copy link
Contributor Author

@schmichael, would you be able to share the configs, TF, or other code you are using in an attempt to reproduce?

@dadgar
Copy link
Contributor

dadgar commented Sep 15, 2016

@magiconair TF-> Terraform.

@dadgar
Copy link
Contributor

dadgar commented Sep 15, 2016

@ketzacoatl It would be great if you could keep us in the loop. Would be good to know if your issue was due to the same thing magiconair was experiencing!

@ketzacoatl
Copy link
Contributor Author

I shall @dadgar!

@ketzacoatl
Copy link
Contributor Author

My weekend was a bit too overloaded, I will update when I can run some tests.

@schmichael
Copy link
Member

@magiconair I just merged #1735 which fixes the issue you were seeing where clients flap when servers are in another datacenter.

@ketzacoatl If you have a chance to test master, I'd be interested if it happened to fix your issue to. Bootstrapping/discovery got reworked a bit.

I'm going to close this, but please feel free to reopen if any flapping issues continue.

@ketzacoatl
Copy link
Contributor Author

ketzacoatl commented Sep 28, 2016

OK, I can run some tests today. @schmichael, is there a .zip I can download or should I build from master and host that?

@dadgar
Copy link
Contributor

dadgar commented Sep 28, 2016

I would build from master

@ketzacoatl
Copy link
Contributor Author

OK, thanks for confirming, I'll work on that.

@ketzacoatl
Copy link
Contributor Author

It looks like there is a build failure on master, though I'm not a go dev, so maybe I did something wrong. When I run make test or make dev, I see:

~/.go/src/github.com/hashicorp/nomad# make dev
--> Running go fmt
--> Running go generate
==> Removing old directory...
==> Building...
Number of parallel builds: 1

-->     linux/amd64: github.com/hashicorp/nomad

1 errors occurred:
--> linux/amd64 error: exit status 1
Stderr: nomad/job_endpoint.go:4:2: cannot find package "context" in any of:
        /home/ubuntu/.go/src/github.com/hashicorp/nomad/vendor/context (vendor tree)
        /usr/local/go/src/context (from $GOROOT)
        /home/ubuntu/.go/src/context (from $GOPATH)

make: *** [dev] Error 1

Though I see context:

~/.go/src/github.com/hashicorp/nomad# find $GOPATH -type d -name context
/home/ubuntu/.go/src/github.com/kardianos/govendor/context
/home/ubuntu/.go/src/github.com/hashicorp/nomad/vendor/golang.org/x/net/context

@iverberk
Copy link
Contributor

@ketzacoatl you need to compile with Go 1.7 (see: https://golang.org/doc/go1.7#context).

@ketzacoatl
Copy link
Contributor Author

ketzacoatl commented Sep 28, 2016

ohhhh... well thanks for that, I didn't even realize 1.7 was out.. and I totally ought to have seen that detail in the README.. sorry for the noise.

@ketzacoatl
Copy link
Contributor Author

ketzacoatl commented Sep 29, 2016

OK, initial tests confirm this as resolved. I am going to scale up these tests to be a little bit more sure. To satisfy my curiosity, does anyone know what may have been the issue? Was it the bootstrap refactor, or the RPC retry fix?

Also, @dadgar, how close is 0.5.x to an RC or release? I'm wondering if I should hold out for that, or start running the build I have from yesterday's master.

@schmichael
Copy link
Member

schmichael commented Sep 29, 2016

@ketzacoatl Scaling up the tests would be great. I'm afraid I can't point to a specific line that broke it in 0.4.1 or fixed it in master -- too much changed. The old code allowed consul discovery and heartbeating from the servers to interleave in unintended ways. The new code ensures that discovering servers is only done with Consul when there are no known servers to heartbeat (startup bootstrapping is the most common case, but there are outage situations that could fallback to Consul as well).

@ketzacoatl
Copy link
Contributor Author

I'm pretty certain this is resolved in my deployments!

@schmichael
Copy link
Member

Fantastic! 0.5 should be out in October.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants