Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not reaching vault makes nomad re-schedule all allocations #2689

Closed
jorgemarey opened this issue Jun 5, 2017 · 10 comments
Closed

Not reaching vault makes nomad re-schedule all allocations #2689

jorgemarey opened this issue Jun 5, 2017 · 10 comments

Comments

@jorgemarey
Copy link
Contributor

Hi,

I noticed that sometimes all the allocations that are placed on my nomad cluster are re-scheduled simultaniously. This happens very rarely, but it does. I suspect that is related to the Vault integration.

The setup is the next:

3 instances with nomad server and consul agent
3 instances with consul server and nomad client
a buch of other instances with consul agent and nomad client.

Im running consul 0.7.5 and nomad 0.5.6

On other instance I have vault (just one server) and I have nomad configured to use that vault.

The log output when this happens:

logs-client1.txt (client with some allocations)
logs-sched1.txt
logs-sched2.txt
logs-sched3.txt (leader)

This happens when vault is not reachable, maybe due to some kind of network issue. Then nomad starts reescheduling all allocations (all of them were runnning fine).

I'm able to reproduce this behaviour, just by setting a security group on the vault machine denying the access to the port where vault is running.

I don't know if this behaviour is expected, but it doesn't feel right that due to a moment of not reaching vault correctly nomad should start moving allocations around.

@jorgemarey jorgemarey changed the title Not reaching vault makes nomad reeschedule all allocations Not reaching vault makes nomad re-schedule all allocations Jun 5, 2017
@dadgar
Copy link
Contributor

dadgar commented Jun 23, 2017

Hey @jorgemarey,

Looking at the logs it appears that the allocations are being restarted because the nodes are unreachable by the Nomad servers (we use heartbeating to detect if clients are alive). You can see this in the server logs:

Jun  5 07:08:55 scheduler-3 nomad: 2017/06/05 07:08:55.432742 [WARN] nomad.heartbeat: node 'b2a16c9b-984e-4d06-8a7b-299cc811f219' TTL expired
Jun  5 07:08:55 scheduler-3 nomad: 2017/06/05 07:08:55.441002 [DEBUG] nomad.client: revoking 1 accessors on node "b2a16c9b-984e-4d06-8a7b-299cc811f219" due to down state
Jun  5 07:08:55 scheduler-3 nomad: 2017/06/05 07:08:55.694015 [WARN] nomad.heartbeat: node '3864e028-574f-42d2-a45e-6cd5c00aa7b5' TTL expired
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.358889 [WARN] nomad.heartbeat: node '1a3f9d9e-8359-477e-82d9-8600f27c939c' TTL expired
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.368177 [DEBUG] nomad.client: revoking 1 accessors on node "1a3f9d9e-8359-477e-82d9-8600f27c939c" due to down state
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.369361 [DEBUG] http: Request /v1/jobs?prefix=testjob (111.632µs)
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57 [DEBUG] memberlist: TCP connection from=127.0.0.1:34338
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.982567 [WARN] nomad.heartbeat: node 'e6fc28c0-ee16-4114-b7bd-2dfb949b0138' TTL expired
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.995397 [DEBUG] worker: dequeued evaluation 1af847f3-8a68-f5db-966c-fc9aeafadfb0
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.995499 [DEBUG] sched: <Eval '1af847f3-8a68-f5db-966c-fc9aeafadfb0' JobID: 'testjob'>: allocs: (place 10) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.995542 [DEBUG] sched: <Eval '1af847f3-8a68-f5db-966c-fc9aeafadfb0' JobID: 'testjob'>: setting status to complete
Jun  5 07:08:58 scheduler-3 nomad: 2017/06/05 07:08:58.000676 [DEBUG] worker: updated evaluation <Eval '1af847f3-8a68-f5db-966c-fc9aeafadfb0' JobID: 'testjob'>
Jun  5 07:08:58 scheduler-3 nomad: 2017/06/05 07:08:58.000706 [DEBUG] worker: ack for evaluation 1af847f3-8a68-f5db-966c-fc9aeafadfb0
Jun  5 07:08:58 scheduler-3 nomad: 2017/06/05 07:08:58.436426 [DEBUG] http: Request /v1/job/testjob/evaluations (380.341µs)
Jun  5 07:08:59 scheduler-3 nomad: 2017/06/05 07:08:59.109917 [WARN] nomad.heartbeat: node '7e9fa8b2-a6aa-4b36-94df-b66e3562f113' TTL expired
Jun  5 07:09:00 scheduler-3 nomad: 2017/06/05 07:09:00.554382 [WARN] nomad.heartbeat: node 'c5cb4d3e-6057-4a76-a259-49d220d71fb4' TTL expired
Jun  5 07:09:00 scheduler-3 nomad: 2017/06/05 07:09:00.617842 [WARN] nomad.heartbeat: node 'c559a6c8-a78c-49d8-a437-2ef07d4d046a' TTL expired
Jun  5 07:09:01 scheduler-3 nomad: 2017/06/05 07:09:01.114910 [DEBUG] http: Request /v1/status/peers (54.547µs)
Jun  5 07:09:01 scheduler-3 nomad: 2017/06/05 07:09:01.526295 [DEBUG] http: Request /v1/job/testjob (228.738µs)
Jun  5 07:09:02 scheduler-3 nomad: 2017/06/05 07:09:02.588187 [DEBUG] http: Request /v1/job/testjob/summary (95.34µs)
Jun  5 07:09:02 scheduler-3 nomad: 2017/06/05 07:09:02.628075 [WARN] nomad.heartbeat: node 'ac9174e2-8241-463d-8002-ce73085997b3' TTL expired
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.393218 [WARN] nomad.heartbeat: node '0ede4a3a-c363-40e0-8072-fa66126df861' TTL expired
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.403108 [DEBUG] worker: dequeued evaluation daa97545-30be-c8d3-f2c0-d298db1f8f8d
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.403181 [DEBUG] sched: <Eval 'daa97545-30be-c8d3-f2c0-d298db1f8f8d' JobID: 'testjob'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.403191 [DEBUG] sched: <Eval 'daa97545-30be-c8d3-f2c0-d298db1f8f8d' JobID: 'testjob'>: setting status to complete
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.407050 [DEBUG] worker: updated evaluation <Eval 'daa97545-30be-c8d3-f2c0-d298db1f8f8d' JobID: 'testjob'>

When Vault is down, Nomad will not start rescheduling allocations. What the client will do is attempt to keep renewing the token until it expires. After that it will attempt to retrieve new token and once it successfully has it will take action based on the config vault change_mode configured.

I just tried this and you get continued renewal attempts as expected. The logs start right after taking Vault down:

2017/06/23 18:32:43.065439 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 17
    2017/06/23 18:32:43.065581 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:00.065431409 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:32:47.897486 [INFO] fingerprint.vault: Vault is unavailable
    2017/06/23 18:32:48.513301 [DEBUG] client: state changed, updating node.
    2017/06/23 18:32:48.514728 [INFO] client: node registration complete
    2017/06/23 18:32:55.369069 [DEBUG] http: Request /v1/jobs (715.78µs)
    2017/06/23 18:32:56.185014 [DEBUG] http: Request /v1/jobs?prefix=e (353.303µs)
    2017/06/23 18:32:56.186309 [DEBUG] http: Request /v1/job/example (124.032µs)
    2017/06/23 18:32:56.188266 [DEBUG] http: Request /v1/job/example/allocations?all=false (182.794µs)
    2017/06/23 18:32:56.189775 [DEBUG] http: Request /v1/job/example/evaluations (109.272µs)
    2017/06/23 18:32:56.193036 [DEBUG] http: Request /v1/job/example/summary (1.059045ms)
    2017/06/23 18:33:00.068070 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 18
    2017/06/23 18:33:00.068146 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:18.068060858 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:33:18.070571 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 13
    2017/06/23 18:33:18.071624 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:31.070563996 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:33:31.072939 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 19
    2017/06/23 18:33:31.073044 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:50.072932144 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:33:50.074359 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 10
    2017/06/23 18:33:50.074402 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:34:00.074351749 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused

So I think something else is going on in your test environment!

@dadgar dadgar closed this as completed Jun 23, 2017
@jorgemarey
Copy link
Contributor Author

Hi @dadgar, thank you for your answer.

At first I thought that too. That allocations are being restarted due to the nodes being unreachable (as is seen in the logs), but it felt weird that all of the nodes had lost contact at the same time. Then I tried to cut the commucation between the servers and vault and it happen again.

The behaviour that I was expecting is what you explain. I'll try to see if there's something else going on (maybe is something weird in my environment).

Thanks a lot.

@jorgemarey
Copy link
Contributor Author

Hi @dadgar,

Sorry to bother you again with this, but I keep seing this behaviour on the cluster that I have set up.
I'll try to explain my problem a bit better.

As you said, allocations are being restarted because the nodes are unreachable by the Nomad servers. That is what is happening with the allocations. The thing is that I don't know why all nodes are set as unreachable at the same time, when there's a problem reaching vault.

If I block the connection to vault, all nodes are marked as down by the server. I was able to reproduce this (althought not on my laptop with the dev server). Every time I do that all nodes are being set as unreachable within seconds.

Some logs (afther blocking the connection with vault).

The nomad server leader:


Jul 17 09:43:39 server nomad: 2017/07/17 09:43:39.924270 [WARN] nomad.heartbeat: node '505abc2c-9f89-40bf-81e8-cd074937ec84' TTL expired
Jul 17 09:43:41 server nomad: 2017/07/17 09:43:41.265589 [WARN] nomad.heartbeat: node '70d5c037-42d1-464d-ab02-fbe141b5e467' TTL expired
Jul 17 09:43:41 server nomad: 2017/07/17 09:43:41.759747 [WARN] nomad.heartbeat: node '92ddafbf-4014-4eed-8abc-ba37771cdbaf' TTL expired
Jul 17 09:43:43 server nomad: 2017/07/17 09:43:43.310631 [WARN] nomad.heartbeat: node '683a0a18-0395-4fc9-9225-5eb30276a247' TTL expired
Jul 17 09:43:45 server nomad: 2017/07/17 09:43:45.752570 [WARN] nomad.heartbeat: node '40625c55-98b6-4e4e-8a45-2b9ac171fdd1' TTL expired
Jul 17 09:43:47 server nomad: 2017/07/17 09:43:47.202172 [WARN] nomad.heartbeat: node '9e3b96c1-733c-4073-9205-f02cbad2658b' TTL expired
Jul 17 09:43:47 server nomad: 2017/07/17 09:43:47.853258 [WARN] nomad.heartbeat: node '283e35b3-d42c-4b38-8864-4cc12c9dfa63' TTL expired
Jul 17 09:43:52 server nomad: 2017/07/17 09:43:52.105220 [WARN] nomad.heartbeat: node 'd0e949f1-b694-427a-a5be-94aa5e4b89f7' TTL expired

//all nodes again after a few seconds

Jul 17 09:44:24 server nomad: 2017/07/17 09:44:24.220079 [WARN] nomad.heartbeat: node 'd0e949f1-b694-427a-a5be-94aa5e4b89f7' TTL expired
Jul 17 09:44:24 server nomad: 2017/07/17 09:44:24.978572 [WARN] nomad.heartbeat: node '92ddafbf-4014-4eed-8abc-ba37771cdbaf' TTL expired
Jul 17 09:44:29 server nomad: 2017/07/17 09:44:29.415200 [WARN] nomad.heartbeat: node '683a0a18-0395-4fc9-9225-5eb30276a247' TTL expired
Jul 17 09:44:30 server nomad: 2017/07/17 09:44:30.516824 [WARN] nomad.heartbeat: node '40625c55-98b6-4e4e-8a45-2b9ac171fdd1' TTL expired
Jul 17 09:44:30 server nomad: 2017/07/17 09:44:30.985788 [WARN] nomad.heartbeat: node '9e3b96c1-733c-4073-9205-f02cbad2658b' TTL expired
Jul 17 09:44:31 server nomad: 2017/07/17 09:44:31.397590 [WARN] nomad.heartbeat: node '70d5c037-42d1-464d-ab02-fbe141b5e467' TTL expired
Jul 17 09:44:35 server nomad: 2017/07/17 09:44:35.569198 [WARN] nomad.heartbeat: node '505abc2c-9f89-40bf-81e8-cd074937ec84' TTL expired
Jul 17 09:44:45 server nomad: 2017/07/17 09:44:45.391661 [WARN] nomad.heartbeat: node '283e35b3-d42c-4b38-8864-4cc12c9dfa63' TTL expired

These messages keep happening until I enable the connection with vault.

Node 283e35b3-d42c-4b38-8864-4cc12c9dfa63:

Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.764316 [INFO] client: marking allocation 1fa3309f-4cf3-b339-fa56-b37202e662ab for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.767196 [INFO] client: marking allocation 1fa3309f-4cf3-b339-fa56-b37202e662ab for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.770138 [ERR] client.vault: failed to derive vault tokens: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.770153 [ERR] client.vault: failed to derive token for allocation "1fa3309f-4cf3-b339-fa56-b37202e662ab" and tasks [fabio]: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.770159 [ERR] client: failed to derive Vault token for task fabio on alloc "1fa3309f-4cf3-b339-fa56-b37202e662ab": Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.831610 [INFO] client: marking allocation e3792c38-75b3-c4bd-c5c2-33bb50a2cd35 for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.832124 [INFO] client: marking allocation e3792c38-75b3-c4bd-c5c2-33bb50a2cd35 for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.892061 [INFO] client: node registration complete
Jul 17 09:44:45 client-1 nomad: 2017/07/17 09:44:45.970412 [INFO] client: marking allocation 781b01ce-f173-5ff8-98d6-647ad03c7351 for GC
Jul 17 09:44:46 client-1 nomad: 2017/07/17 09:44:46.747740 [INFO] client: marking allocation 781b01ce-f173-5ff8-98d6-647ad03c7351 for GC

Node 40625c55-98b6-4e4e-8a45-2b9ac171fdd1:

Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.699343 [INFO] client: marking allocation 75928cd5-8e98-312b-0f45-c4cc78c0dd8f for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.701730 [INFO] client: marking allocation 75928cd5-8e98-312b-0f45-c4cc78c0dd8f for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.702837 [ERR] client.vault: failed to derive vault tokens: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.702847 [ERR] client.vault: failed to derive token for allocation "75928cd5-8e98-312b-0f45-c4cc78c0dd8f" and tasks [fabio]: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.702867 [ERR] client: failed to derive Vault token for task fabio on alloc "75928cd5-8e98-312b-0f45-c4cc78c0dd8f": Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.760801 [INFO] client: marking allocation 0fcadd17-69a6-e9cb-035b-135484ea7f23 for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.761265 [INFO] client: marking allocation 0fcadd17-69a6-e9cb-035b-135484ea7f23 for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.824326 [INFO] client: node registration complete
Jul 17 09:44:30 client-2 nomad: 2017/07/17 09:44:30.132137 [INFO] client: marking allocation 23ba1c67-5d1e-1fd5-0358-1f8b40be227c for GC
Jul 17 09:44:45 client-2 nomad: 2017/07/17 09:44:45.683035 [INFO] client: marking allocation 23ba1c67-5d1e-1fd5-0358-1f8b40be227c for GC

The rest of the nodes have the same kind of logs.

I found weird the following: [INFO] client: node registration complete on the clients. I don't know if that's normal.

Maybe I have some problem on my configuration, the set up, or other stuff, but keep seing this (nodes unreacheable) every time there's network problem reaching vault.

Can you think of something that could trigger this kind of behaviour?

Thanks!

@dadgar
Copy link
Contributor

dadgar commented Jul 17, 2017

@jorgemarey I will reopen and try to reproduce post 0.6.0 and if it is happening, I will get a fix for 0.6.1!

@jorgemarey
Copy link
Contributor Author

Thanks a lot! I'll upgrade when it's released and test this.

@jorgemarey
Copy link
Contributor Author

Hi @dadgar

I just noticed this again on a cluster with nomad v0.6.2, consul 0.9.2 and vault 0.8.2. We just had a problem with consul (used as storage for vault) and at that time we started seeing all allocations being re-scheduled.

Some logs:

Sep 11 09:29:55 scheduler-2 nomad: 2017/09/11 09:29:55.015945 [WARN] nomad.heartbeat: node 'e7fd0c16-54df-aef9-191a-b5335d5ca3ce' TTL expired
Sep 11 09:29:55 scheduler-2 nomad: 2017/09/11 09:29:55.526508 [WARN] nomad.heartbeat: node '32e86f68-ba90-1a1f-3399-22320cf83b94' TTL expired
Sep 11 09:29:55 scheduler-2 nomad: 2017/09/11 09:29:55.978191 [WARN] nomad.heartbeat: node 'ed4fdd36-9be3-6882-8368-3737c53248c4' TTL expired
Sep 11 09:29:57 scheduler-2 nomad: 2017/09/11 09:29:57.878707 [WARN] nomad.heartbeat: node '211a872b-4e58-dc65-e8b1-aa193321e114' TTL expired
Sep 11 09:29:59 scheduler-2 nomad: 2017/09/11 09:29:59.928270 [WARN] nomad.heartbeat: node 'a4198cb4-82d9-070e-c00e-1ada8057fd19' TTL expired
Sep 11 09:30:00 scheduler-2 nomad: 2017/09/11 09:30:00.550745 [WARN] nomad.heartbeat: node '8b87760b-f517-152d-af57-8a8d66014fe8' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.144295 [WARN] nomad.heartbeat: node 'df089a5b-be37-4b1d-d637-ac58e0391dda' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.702767 [WARN] nomad.heartbeat: node '0be504bd-e798-9659-9913-f5e7475036ed' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.841496 [WARN] nomad.heartbeat: node '6933fc67-e85f-0cf0-1bf9-ccff71acaaaf' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.985344 [WARN] nomad.heartbeat: node '78ee629a-723e-f2ec-043b-32342a7657f8' TTL expired
Sep 11 09:30:03 scheduler-2 nomad: 2017/09/11 09:30:03.180718 [WARN] nomad.heartbeat: node '7b511b69-e647-758c-91f0-b9c607180cb3' TTL expired
Sep 11 09:30:04 scheduler-2 nomad: 2017/09/11 09:30:04.340450 [WARN] nomad.heartbeat: node '989d8ec9-74cb-16ba-7f8c-685493f850ca' TTL expired
Sep 11 09:30:04 scheduler-2 nomad: 2017/09/11 09:30:04.493615 [WARN] nomad.heartbeat: node '93b6ab35-1d59-829e-1bc9-e3c783fc3667' TTL expired
Sep 11 09:30:06 scheduler-2 nomad: 2017/09/11 09:30:06.633612 [WARN] nomad.heartbeat: node 'bd5852de-ff47-8367-1fc6-67a59961102a' TTL expired
Sep 11 09:30:08 scheduler-2 nomad: 2017/09/11 09:30:08.263329 [WARN] nomad.heartbeat: node '386603d1-f6bf-3292-6b15-870bf6edcfb0' TTL expired
Sep 11 09:30:11 scheduler-2 nomad: 2017/09/11 09:30:11.090010 [WARN] nomad.heartbeat: node '2b3b9f9a-33a9-fd30-1190-eeec47c00b30' TTL expired
Sep 11 09:30:12 scheduler-2 nomad: 2017/09/11 09:30:12.499259 [WARN] nomad.heartbeat: node 'aef94ab7-ca43-af01-2bde-bde54e28cd6c' TTL expired
Sep 11 09:30:15 scheduler-2 nomad: 2017/09/11 09:30:15.676352 [WARN] vault: failed to revoke tokens. Will reattempt til TTL: failed to revoke token (alloc: "f151f448-da1e-78ea-575a-862a3caa09b3", node: "93b6ab35-1d59-829e-1bc9-e3c783fc3667", task: "fabio"): Error making API request.
Sep 11 09:30:15 scheduler-2 nomad: URL: POST https://vault.service.consul:8200/v1/auth/token/revoke-accessor
Sep 11 09:30:15 scheduler-2 nomad: Code: 400. Errors:
Sep 11 09:30:15 scheduler-2 nomad: * 1 error occurred:
Sep 11 09:30:15 scheduler-2 nomad: * invalid accessor
Sep 11 09:30:18 scheduler-2 nomad: 2017/09/11 09:30:18 [ERR] yamux: keepalive failed: i/o deadline reached
Sep 11 09:30:18 scheduler-2 nomad: 2017/09/11 09:30:18.415639 [ERR] nomad.rpc: multiplex conn accept failed: keepalive timeout

It seems the same problem. When our nomad cluster can't reach vault correctly it marks all nodes as down. I don't know why this happens, maybe it's a problem of our set-up...

@dadgar
Copy link
Contributor

dadgar commented Sep 11, 2017

@jorgemarey Sorry haven't had time to reproduce this yet. Would you mind trying to distill it down to as simple of reproduction steps as possible?

@dadgar
Copy link
Contributor

dadgar commented Sep 12, 2017

@jorgemarey I got this to reproduce! Thanks!

dadgar added a commit that referenced this issue Sep 12, 2017
This PR removes locking around commonly accessed node attributes that do
not need to be locked. The locking could cause nodes to TTL as the
heartbeat code path was acquiring a lock that could be held for an
excessively long time. An example of this is when Vault is inaccessible,
since the fingerprint is run with a lock held but the Vault
fingerprinter makes the API calls with a large timeout.

Fixes #2689
@jorgemarey
Copy link
Contributor Author

Awesome! Thanks a lot @dadgar !

dadgar added a commit that referenced this issue Sep 14, 2017
This PR removes locking around commonly accessed node attributes that do
not need to be locked. The locking could cause nodes to TTL as the
heartbeat code path was acquiring a lock that could be held for an
excessively long time. An example of this is when Vault is inaccessible,
since the fingerprint is run with a lock held but the Vault
fingerprinter makes the API calls with a large timeout.

Fixes #2689
dadgar added a commit that referenced this issue Sep 14, 2017
This PR removes locking around commonly accessed node attributes that do
not need to be locked. The locking could cause nodes to TTL as the
heartbeat code path was acquiring a lock that could be held for an
excessively long time. An example of this is when Vault is inaccessible,
since the fingerprint is run with a lock held but the Vault
fingerprinter makes the API calls with a large timeout.

Fixes #2689
@github-actions
Copy link

github-actions bot commented Dec 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants