Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orphaned Health Check #1524

Closed
jshaw86 opened this issue Aug 5, 2016 · 35 comments
Closed

Orphaned Health Check #1524

jshaw86 opened this issue Aug 5, 2016 · 35 comments

Comments

@jshaw86
Copy link

jshaw86 commented Aug 5, 2016

Nomad version

nomad 0.4.0

Operating system and Environment details

14.04

Issue

This maybe a consul issue but coming through on the nomad agent, but it seems that a health check can get into a state where the health check endpoint is checked indefinitely but not in the registered.

Reproduction steps

Not completely sure but we scheduled many jobs(~100) in serial and tore them down serially and eventually got to this state.

Nomad Client logs (if appropriate)

Aug  5 20:29:48 nomad-agent1 consul[1167]: agent: http request failed 'http://XXX.XXX.XXX.XXX:34222/ping': Get http://XXX.XXX.XXX.XXX:34222/ping: dial tcp XXX.XXX.XXX.XXX:34222: getsockopt: connection refused

I can share the result of curl localhost:8500/v1/health/state/any to show there is no health check under this IP/Port offline as there is some sensitive information in there.

UPDATE: I got it out of this state by restarting consul

@dadgar
Copy link
Contributor

dadgar commented Aug 5, 2016

Hmm,

Sorry I am having a hard time following. What you are saying is that Consul has a health check registered that does not correspond to any Nomad service running on that machine?

@jshaw86
Copy link
Author

jshaw86 commented Aug 6, 2016

@dadgar correct i'm seeing the above health check still firing even after the nomad job/task group has gone away. Here is another example from another nomad client in the same region:

Aug  6 00:18:22 nomad-agent3:     2016/08/06 00:18:22 [WARN] agent: http request failed 'http://172.16.103.71:21233/ping': Get http://172.16.103.71:21233/ping: dial tcp 172.16.103.71:21233: getsockopt: connection refused

@dadgar
Copy link
Contributor

dadgar commented Aug 6, 2016

@jshaw86 Is the check registered in Consul? And are those logs from nomad or consul?

@sean-
Copy link
Contributor

sean- commented Aug 6, 2016

What type of check is this? Is it a script check?

@jshaw86
Copy link
Author

jshaw86 commented Aug 6, 2016

@dadgar curl localhost:8500/v1/health/state/any doesn't show the check. @sean- it's an http check

@sean-
Copy link
Contributor

sean- commented Aug 7, 2016

@jshaw86 Is this reproducible? When you say you "tore the jobs down serially" how did you do that? If you let the host run for a while, does the check deregister eventually after 4hrs?

@jshaw86
Copy link
Author

jshaw86 commented Aug 7, 2016

@sean- I am able to reproduce it consistently on our staging env by scheduling and tearing down a nomad job every 10s one at a time(serially) and also running nomad gc once a minute on system cron on a nomad sever

@diptanu
Copy link
Contributor

diptanu commented Aug 11, 2016

@jshaw86 When this happens next time, can you share the result of curl http://localhost:8500/v1/agent/checks. We want to see the currently registered consul checks when this happens.

@dadgar
Copy link
Contributor

dadgar commented Aug 12, 2016

So I just ran: while true;do nomad run -detach example.nomad && sleep 3 && nomad stop -yes e;done and register/deregistered the job 200 times and could not reproduce.

@jshaw86 It would be great next time if you got into this condition to get the consul agent logs, the nomad client logs, and the checks diptanu mentioned. If they are sensitive you can email them to me.

@jshaw86
Copy link
Author

jshaw86 commented Aug 19, 2016

We are still running into this intermittently here are the client logs.
https://gist.github.com/jshaw86/d6b5c5285e66d61a604a7b9d881e3f87

@jshaw86
Copy link
Author

jshaw86 commented Aug 20, 2016

We just had another one, these are the most concise logs I can provide:
The timeline is:

  1. the service was created at 1:16 server time
  2. We stopped the service around 1:23 server time
  3. The service did go away till 1:28 server time
Aug 20 01:16:31 nomad-agent6 consul[9023]: agent: Synced service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:16:31 nomad-agent6 nomad-agent6.gold.aws-sjc-1.ps.pn:     2016/08/20 01:16:31 [INFO] agent: Synced service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 nomad-agent6.gold.aws-sjc-1.ps.pn:     2016/08/20 01:28:07 [INFO] agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 consul[9023]: agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 nomad-agent6.gold.aws-sjc-1.ps.pn:     2016/08/20 01:28:07 [INFO] agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 consul[9023]: agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd' 

@ketzacoatl
Copy link
Contributor

I have seen similar behavior from 0.3.x and 0.4.x. I am not sure if it is a consul or nomad issue, but something like consul-cli service deregister ... fails to deregister the service - it persists - at least until I do something drastic to consul, like rm -rf the leaders and such. I do not remember if a reboot was enough, but I can test that. I have a "stuck" service now, one that does not want to be deregistered.

@slackpad
Copy link
Contributor

slackpad commented Aug 31, 2016

Getting the output of /v1/agent/checks as @diptanu suggested would be good if you can. The /v1/health/state/any call goes up to the servers so it would be interesting to see if the agent is in the right state internally but the changes aren't synced to the servers properly, or if the agent also shows the check, too.

@dadgar
Copy link
Contributor

dadgar commented Aug 31, 2016

@ketzacoatl Did you by chance change the tags on the service?

@ketzacoatl
Copy link
Contributor

@dadgar It might be possible, but it's highly unlikely (I rarely change them once set). Also, if I ask consul to deregister the service, consul should remove the entry, no?

@dadgar
Copy link
Contributor

dadgar commented Oct 3, 2016

If anyone is hitting this could you report the /v1/agent/checks from Consul and also see if the executor is still alive (ps aux | grep nomad looking for the alloc-id that has leaked the check)

@dadgar dadgar added this to the v0.5.0 milestone Oct 3, 2016
@jshaw86
Copy link
Author

jshaw86 commented Oct 3, 2016

@dadgar I did check this a while back, the process was cleaned up seems like it was just on the consul end but will try to repo. It's possible that I could have raced something because after 5 minutes it does seem to get cleaned up.

@dadgar
Copy link
Contributor

dadgar commented Oct 3, 2016

@jshaw86 hmm, we run a periodic reaping of all checks/services that shouldn't be there every 30 seconds. There was an issue with the executor crashing that we have fixed in master that may have been adding an artificial delay

@dadgar
Copy link
Contributor

dadgar commented Oct 3, 2016

@jshaw86 It honestly may just be Consuls own reconcile with the servers. It may be gone from the agent but the remote servers have it because of a short outage etc

@jshaw86
Copy link
Author

jshaw86 commented Oct 3, 2016

@dadgar ok have some updated information. Another engineer has seen the executor still stick around when the scheduler says the job has been killed, and the health check still exists because we can route to the job in fabio so you'll have to take our word for it till I'm able to replicate this again. According to other engineers we have only seen this in production lately. This is the logs from our latest experience with the issue from last week (no ps aux or /v1/agent/checks though) sorry.

cody@nomad-sched1:~$ sudo grep --after 10 "block-1377" /var/log/syslog
Sep 30 16:22:52 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (172.591838ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52.574966 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (172.591838ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52.600945 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.516279ms)
Sep 30 16:22:52 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.516279ms)
Sep 30 16:22:52 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (296.87989ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52.904859 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (296.87989ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52 [DEBUG] memberlist: TCP connection from=54.153.25.24:53126
Sep 30 16:22:52 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53126
Sep 30 16:22:53 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (314.066651ms)
Sep 30 16:22:53 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:53.227528 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (314.066651ms)
Sep 30 16:22:53 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (217.344405ms)
Sep 30 16:22:53 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:53.451414 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (217.344405ms)
Sep 30 16:22:53 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (381.372145ms)
Sep 30 16:22:53 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:53.839125 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (381.372145ms)
Sep 30 16:23:00 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:00.379889 [DEBUG] http: Request /v1/status/peers (640.963µs)
Sep 30 16:23:00 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (640.963µs)
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01.957803 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (170.477387ms)
Sep 30 16:23:01 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (170.477387ms)
Sep 30 16:23:01 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:01 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (28.729859ms)
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01.993402 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (28.729859ms)
Sep 30 16:23:01 nomad-sched1 nomad[7917]: worker: dequeued evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01.993642 [DEBUG] worker: dequeued evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.094056 [DEBUG] sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.094123 [DEBUG] sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.099487 [DEBUG] worker: updated evaluation <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>
Sep 30 16:23:02 nomad-sched1 nomad[7917]: worker: updated evaluation <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>
Sep 30 16:23:02 nomad-sched1 nomad[7917]: worker: ack for evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.100054 [DEBUG] worker: ack for evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (152.851468ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.152555 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (152.851468ms)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (324.76766ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.483807 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (324.76766ms)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (218.299436ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.708572 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (218.299436ms)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (195.900792ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.910850 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (195.900792ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02 [DEBUG] memberlist: TCP connection from=54.153.25.24:53135
Sep 30 16:23:02 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53135
Sep 30 16:23:10 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:10.381828 [DEBUG] http: Request /v1/status/peers (612.067µs)
Sep 30 16:23:10 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (612.067µs)
Sep 30 16:23:12 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:12 [DEBUG] memberlist: TCP connection from=54.153.25.24:53138
Sep 30 16:23:12 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53138
Sep 30 16:23:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:20.383669 [DEBUG] http: Request /v1/status/peers (824.531µs)
Sep 30 16:23:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (824.531µs)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22 [DEBUG] memberlist: TCP connection from=52.210.3.117:33295
Sep 30 16:23:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.210.3.117:33295
Sep 30 16:23:22 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (96.402526ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.705804 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (96.402526ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:22 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:22 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.112998ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.722866 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.112998ms)
Sep 30 16:23:22 nomad-sched1 nomad[7917]: worker: dequeued evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.723103 [DEBUG] worker: dequeued evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad[7917]: sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.823614 [DEBUG] sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:22 nomad-sched1 nomad[7917]: sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.823774 [DEBUG] sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:22 nomad-sched1 nomad[7917]: worker: updated evaluation <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.828752 [DEBUG] worker: updated evaluation <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.829258 [DEBUG] worker: ack for evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad[7917]: worker: ack for evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (153.094773ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.883081 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (153.094773ms)
Sep 30 16:23:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53142
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53142
Sep 30 16:23:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:23.221073 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (331.586744ms)
Sep 30 16:23:23 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (331.586744ms)
Sep 30 16:23:23 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (132.801342ms)
Sep 30 16:23:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:23.361030 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (132.801342ms)
Sep 30 16:23:23 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (382.602937ms)
Sep 30 16:23:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:23.749917 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (382.602937ms)
Sep 30 16:23:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:30.385578 [DEBUG] http: Request /v1/status/peers (579.83µs)
Sep 30 16:23:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (579.83µs)
Sep 30 16:23:32 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:32 [DEBUG] memberlist: TCP connection from=54.153.25.24:53146
Sep 30 16:23:32 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53146
Sep 30 16:23:38 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (167.365244ms)
Sep 30 16:23:38 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:38.963120 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (167.365244ms)
Sep 30 16:23:38 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:38 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:38 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:38 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (9.814613ms)
Sep 30 16:23:38 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:38.979790 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (9.814613ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (175.725721ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.161073 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (175.725721ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.334407 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (166.677003ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (166.677003ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (113.739365ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.454551 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (113.739365ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.657075 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (196.093239ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (196.093239ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39 [DEBUG] memberlist: TCP connection from=52.196.105.134:49018
Sep 30 16:23:39 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.196.105.134:49018
Sep 30 16:23:40 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:40.387349 [DEBUG] http: Request /v1/status/peers (571.116µs)
Sep 30 16:23:40 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (571.116µs)
Sep 30 16:23:42 nomad-sched1 nomad[7917]: memberlist: Initiating push/pull sync with: 52.67.133.140:4648
Sep 30 16:23:42 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:42 [DEBUG] memberlist: Initiating push/pull sync with: 52.67.133.140:4648
Sep 30 16:23:42 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:42 [DEBUG] memberlist: TCP connection from=54.153.25.24:53152
Sep 30 16:23:42 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53152
Sep 30 16:23:47 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:47 [DEBUG] memberlist: TCP connection from=52.28.134.183:35171
Sep 30 16:23:47 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.28.134.183:35171
--
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.212563 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (89.62714ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (89.62714ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:24:14 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.229740 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.367554ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.367554ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (291.521385ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.526976 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (291.521385ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.716981 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (183.547855ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (183.547855ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (114.117226ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.837533 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (114.117226ms)
Sep 30 16:24:15 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (196.210604ms)
Sep 30 16:24:15 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:15.040109 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (196.210604ms)
Sep 30 16:24:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:20.394727 [DEBUG] http: Request /v1/status/peers (594.891µs)
Sep 30 16:24:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (594.891µs)
Sep 30 16:24:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53166
Sep 30 16:24:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53166
Sep 30 16:24:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:30.396488 [DEBUG] http: Request /v1/status/peers (589.899µs)
Sep 30 16:24:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (589.899µs)
Sep 30 16:24:32 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:32 [DEBUG] memberlist: TCP connection from=54.153.25.24:53170
Sep 30 16:24:32 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53170
Sep 30 16:24:40 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:40.398469 [DEBUG] http: Request /v1/status/peers (614.901µs)
Sep 30 16:24:40 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (614.901µs)
--
Sep 30 16:30:11 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (618.76µs)
Sep 30 16:30:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:11.802702 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (618.76µs)
Sep 30 16:30:12 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:12 [DEBUG] memberlist: TCP connection from=54.153.25.24:53294
Sep 30 16:30:12 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53294
Sep 30 16:30:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:20.464923 [DEBUG] http: Request /v1/status/peers (828.174µs)
Sep 30 16:30:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (828.174µs)
Sep 30 16:30:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53298
Sep 30 16:30:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53298
Sep 30 16:30:24 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:24 [DEBUG] memberlist: TCP connection from=52.210.3.117:33451
Sep 30 16:30:24 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.210.3.117:33451
Sep 30 16:30:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:30.466843 [DEBUG] http: Request /v1/status/peers (709.511µs)
Sep 30 16:30:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (709.511µs)
--
Sep 30 16:32:23 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (642.208µs)
Sep 30 16:32:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:23.321322 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (642.208µs)
Sep 30 16:32:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:30.504188 [DEBUG] http: Request /v1/status/peers (546.261µs)
Sep 30 16:32:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (546.261µs)
Sep 30 16:32:32 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:32 [DEBUG] memberlist: TCP connection from=54.153.25.24:53354
Sep 30 16:32:32 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53354
Sep 30 16:32:40 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:40.508130 [DEBUG] http: Request /v1/status/peers (824.243µs)
Sep 30 16:32:40 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (824.243µs)
Sep 30 16:32:42 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:42 [DEBUG] memberlist: TCP connection from=54.153.25.24:53358
Sep 30 16:32:42 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53358
Sep 30 16:32:43 nomad-sched1 nomad[7917]: memberlist: Initiating push/pull sync with: 52.53.214.137:4648
Sep 30 16:32:43 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:43 [DEBUG] memberlist: Initiating push/pull sync with: 52.53.214.137:4648
Sep 30 16:32:48 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:48.257714 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (587.943µs)
Sep 30 16:32:48 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (587.943µs)
Sep 30 16:32:50 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:50.509921 [DEBUG] http: Request /v1/status/peers (805.596µs)
Sep 30 16:32:50 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (805.596µs)
Sep 30 16:32:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:52 [DEBUG] memberlist: TCP connection from=54.153.25.24:53363
Sep 30 16:32:52 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53363
Sep 30 16:33:00 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:33:00.511687 [DEBUG] http: Request /v1/status/peers (554.613µs)
Sep 30 16:33:00 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (554.613µs)
Sep 30 16:33:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:33:02 [DEBUG] memberlist: TCP connection from=54.153.25.24:53366
Sep 30 16:33:02 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53366
Sep 30 16:33:10 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:33:10.517381 [DEBUG] http: Request /v1/status/peers (865.756µs)
Sep 30 16:33:10 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (865.756µs)
--
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.498730 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (82.419862ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (82.419862ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.514238 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.711153ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.711153ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (163.844039ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.682368 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (163.844039ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (167.805091ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.855205 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (167.805091ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (124.516265ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.984734 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (124.516265ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02 [DEBUG] memberlist: TCP connection from=54.153.25.24:53433
Sep 30 16:35:02 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53433
Sep 30 16:35:03 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (401.086005ms)
Sep 30 16:35:03 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:03.390524 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (401.086005ms)
Sep 30 16:35:10 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:10.544413 [DEBUG] http: Request /v1/status/peers (639.484µs)
Sep 30 16:35:10 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (639.484µs)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (168.885863ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.255354 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (168.885863ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.272016 [DEBUG] worker: dequeued evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: dequeued evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.324493ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.272175 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.324493ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 1) (ignore 0) (lost 0)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.372482 [DEBUG] sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 1) (ignore 0) (lost 0)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.379207 [DEBUG] worker: submitted plan for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: submitted plan for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: setting status to complete
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.379273 [DEBUG] sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: setting status to complete
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.384522 [DEBUG] worker: updated evaluation <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: updated evaluation <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.385040 [DEBUG] worker: ack for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: ack for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.434930 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (156.806916ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (156.806916ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (171.396837ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.612535 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (171.396837ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (123.235384ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.742420 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (123.235384ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (201.863125ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.950616 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (201.863125ms)
Sep 30 16:35:12 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:12 [DEBUG] memberlist: TCP connection from=54.153.25.24:53438
Sep 30 16:35:12 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53438
Sep 30 16:35:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:20.546219 [DEBUG] http: Request /v1/status/peers (563.724µs)
Sep 30 16:35:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (563.724µs)
Sep 30 16:35:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53441
Sep 30 16:35:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53441
Sep 30 16:35:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:30.548164 [DEBUG] http: Request /v1/status/peers (682.291µs)
Sep 30 16:35:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (682.291µs)
Sep 30 16:35:33 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53445
Sep 30 16:35:33 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:33 [DEBUG] memberlist: TCP connection from=54.153.25.24:53445
--
Sep 30 17:18:26 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (636.73µs)
Sep 30 17:18:26 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:26.751581 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (636.73µs)
Sep 30 17:18:29 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-iad-1 (93.550792ms)
Sep 30 17:18:29 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:29.886239 [DEBUG] http: Request /v1/job/block-1383?region=aws-iad-1 (93.550792ms)
Sep 30 17:18:29 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:29.913957 [DEBUG] http: Request /v1/job/block-1383?region=aws-sjc-1 (20.581057ms)
Sep 30 17:18:29 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-sjc-1 (20.581057ms)
Sep 30 17:18:30 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-dub-1 (157.772615ms)
Sep 30 17:18:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:30.077297 [DEBUG] http: Request /v1/job/block-1383?region=aws-dub-1 (157.772615ms)
Sep 30 17:18:30 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-fra-1 (313.180495ms)
Sep 30 17:18:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:30.397174 [DEBUG] http: Request /v1/job/block-1383?region=aws-fra-1 (313.180495ms)
Sep 30 17:18:30 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-hnd-1 (113.223584ms)
Sep 30 17:18:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:30.517723 [DEBUG] http: Request /v1/job/block-1383?region=aws-hnd-1 (113.223584ms)

@jshaw86
Copy link
Author

jshaw86 commented Oct 21, 2016

@dadgar, I wasn't able to get to CLI in time to get a ps or curl the agent checks but I can confirm the executor and process are still running as the transformation I was doing in the request was still getting returned, then after 5 minutes it goes away.

Will try to catch it again when I see it it's just very hard to react in time. :)

@jshaw86
Copy link
Author

jshaw86 commented Nov 6, 2016

just an update we are going to upgrade to consul 0.7, as there seemed to be some health check fixes in there, and monitor to see if that fixes the issue. I'm not 100% sure this will do it as some of the nomad jobs were still running after being told to stop.

@dadgar dadgar added this to the v0.6.0 milestone Nov 9, 2016
@dadgar dadgar removed this from the v0.5.0 milestone Nov 9, 2016
@dadgar
Copy link
Contributor

dadgar commented Nov 9, 2016

Changed the milestone not to indicate that we won't continue to work on this but we have larger plans on refactoring the consul code that should make it more robust. This is a larger piece of work though so want the milestone to reflect it.

@jshaw86
Copy link
Author

jshaw86 commented Nov 10, 2016

@dadgar does this mean ya'll have root causes this down to the nomad implementation of consul or just speculting?

@dadgar
Copy link
Contributor

dadgar commented Nov 10, 2016

Root cause hasn't been determined but I wouldn't chalk it up to speculation either. The current implementation is complex and has low debug ability as consul syncing occurs in various processes and if they ever get out of sync they are prone to flap.

The new design will be centralized, simpler and much easier to debug.

@jshaw86
Copy link
Author

jshaw86 commented Nov 10, 2016

@dadgar oh ok. So when we upgrade to 0.7 you wouldn't expect much difference in behavior? Is there anything we should look for after we upgrade in the logs besides the ps and v1 health check call you recommended?

@dadgar
Copy link
Contributor

dadgar commented Nov 10, 2016

@jshaw86 Yeah I would try to capture the logs, ps aux, agent and /v1/health/state/any. It is hard to tell if the bug is Nomad or Consul. The difference between agent and server health should be telling

@jshaw86
Copy link
Author

jshaw86 commented Jan 25, 2017

@dadgar once this is delivered in 0.6.0 is the idea that it will be fixed or that we will be able to gather meaningful logs to then address the issue?

I noticed some changelog statements on the consul end that may have helped the stability as well so were you guys thinking that it could be a consul issue or combination of consul and nomad.

@dadgar
Copy link
Contributor

dadgar commented Jan 28, 2017

@jshaw86 I am pretty sure it is a Nomad issue but as mentioned lack of debug ability makes it difficult to be certain. It is likely that 0.5.4/0.5.5 will bring the fix and it will be much easier to provide useful logs since the nomad client itself will be managing the registration/deregistration rather than the executor.

@jshaw86
Copy link
Author

jshaw86 commented Feb 1, 2017

Just an update I created a generic nomad service with Count of 1000 with an http health check on consul 0.7.2 with nomad 0.4.2 and seeing the same connection refused orphans in syslog:

Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:26127/ping': Get http://172.16.103.248:26127/ping: dial tcp 172.16.103.248:26127: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:55514/ping': Get http://172.16.103.248:55514/ping: dial tcp 172.16.103.248:55514: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:22438/ping': Get http://172.16.103.248:22438/ping: dial tcp 172.16.103.248:22438: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:37401/ping': Get http://172.16.103.248:37401/ping: dial tcp 172.16.103.248:37401: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:55514/ping': Get http://172.16.103.248:55514/ping: dial tcp 172.16.103.248:55514: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:22438/ping': Get http://172.16.103.248:22438/ping: dial tcp 172.16.103.248:22438: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:37401/ping': Get http://172.16.103.248:37401/ping: dial tcp 172.16.103.248:37401: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:38639/ping': Get http://172.16.103.248:38639/ping: dial tcp 172.16.103.248:38639: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:32198/ping': Get http://172.16.103.248:32198/ping: dial tcp 172.16.103.248:32198: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:40691/ping': Get http://172.16.103.248:40691/ping: dial tcp 172.16.103.248:40691: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:32467/ping': Get http://172.16.103.248:32467/ping: dial tcp 172.16.103.248:32467: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:39892/ping': Get http://172.16.103.248:39892/ping: dial tcp 172.16.103.248:39892: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:50814/ping': Get http://172.16.103.248:50814/ping: dial tcp 172.16.103.248:50814: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:52702/ping': Get http://172.16.103.248:52702/ping: dial tcp 172.16.103.248:52702: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:40036/ping': Get http://172.16.103.248:40036/ping: dial tcp 172.16.103.248:40036: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:38639/ping': Get http://172.16.103.248:38639/ping: dial tcp 172.16.103.248:38639: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:32198/ping': Get http://172.16.103.248:32198/ping: dial tcp 172.16.103.248:32198: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:40691/ping': Get http://172.16.103.248:40691/ping: dial tcp 172.16.103.248:40691: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:32467/ping': Get http://172.16.103.248:32467/ping: dial tcp 172.16.103.248:32467: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:39892/ping': Get http://172.16.103.248:39892/ping: dial tcp 172.16.103.248:39892: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:50814/ping': Get http://172.16.103.248:50814/ping: dial tcp 172.16.103.248:50814: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:52702/ping': Get http://172.16.103.248:52702/ping: dial tcp 172.16.103.248:52702: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:40036/ping': Get http://172.16.103.248:40036/ping: dial tcp 172.16.103.248:40036: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:21025/ping': Get http://172.16.103.248:21025/ping: dial tcp 172.16.103.248:21025: getsockopt: connection refused

Will repeat with 0.5.3 and hopefully 0.6.0 fixes this.

@dadgar
Copy link
Contributor

dadgar commented Feb 1, 2017

@jshaw86 Can you break down the steps:

1 Nomad agent? Bring a single job with a health check on a dynamic port to count 1000 and then stop the job and then theres orphans?

@jshaw86
Copy link
Author

jshaw86 commented Feb 1, 2017

@dadgar

In production it's hundreds of jobs with Count < 3, three nomad servers, many nomad agents, federated six regions. This particular case it was one job with Count > 1000, single nomad server four nomad clients. Both should reproduce the issue.

Steps:
1 . Provision a bunch jobs or a job with a high count
2. nomad stop
3. curl -X PUT http://localhost:4646/v1/system/gc
4. tail syslog or wherever your consul logs dump to

@schmichael
Copy link
Member

This should be fixed by #2467 and released in 0.6.0.

Linking to #2478 for tracking.

@schmichael
Copy link
Member

(Should have closed with my last comment; sorry for the delay)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants