Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client.vault: failed to derive vault tokens #2050

Closed
blalor opened this issue Dec 1, 2016 · 15 comments
Closed

client.vault: failed to derive vault tokens #2050

blalor opened this issue Dec 1, 2016 · 15 comments

Comments

@blalor
Copy link
Contributor

blalor commented Dec 1, 2016

Nomad version

Nomad v0.5.0

Operating system and Environment details

Fully operational Vault cluster running 0.6.2.

Issue

A task allocation that uses the Vault integration occasionally fails to start on a client, with the errors client.vault: failed to derive vault tokens: Allocation "…" does not exist and client: failed to derive Vault token for task agent on alloc "…": Allocation "…" does not exist. The task does not get reschedule elsewhere in the cluster.

Reproduction steps

Unsure.

Nomad Server logs (if appropriate)

2016/12/01 15:55:13.688149 [DEBUG] worker: dequeued evaluation 78564c7a-7b75-698b-8625-73c3b6659fa1
2016/12/01 15:55:13.688246 [DEBUG] sched: <Eval '78564c7a-7b75-698b-8625-73c3b6659fa1' JobID: 'gocd-agents'>: allocs: (place 0) (update 0) (migrate 2) (stop 0) (ignore 1) (lost 0)
2016/12/01 15:55:13.707710 [DEBUG] sched: <Eval '78564c7a-7b75-698b-8625-73c3b6659fa1' JobID: 'gocd-agents'>: setting status to complete
2016/12/01 15:55:13.718315 [DEBUG] worker: updated evaluation <Eval '78564c7a-7b75-698b-8625-73c3b6659fa1' JobID: 'gocd-agents'>
2016/12/01 15:55:13.718358 [DEBUG] worker: ack for evaluation 78564c7a-7b75-698b-8625-73c3b6659fa1

Nomad Client logs (if appropriate)

# nomad alloc-status b11047c8
ID = b11047c8
Eval ID = 5f420faf
Name = gocd-agents.rkt-agents[1]
Node ID = bae50522
Job ID = gocd-agents
Client Status = failed
Client Description = <none>
Created At = 12/01/16 15:55:13 UTC
Task "agent" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
2400 MHz 1.0 GiB 0 B 0
Recent Events:
Time Type Description
12/01/16 15:55:13 UTC Killing Killing task: vault: failed to derive token: Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist
12/01/16 15:55:13 UTC Received Task received by client
2016/12/01 15:55:13.661514 [DEBUG] client: updated allocations at index 1434 (pulled 1) (filtered 0)
2016/12/01 15:55:13.662682 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
2016/12/01 15:55:13.663480 [DEBUG] client: starting task runners for alloc 'b11047c8-ad1c-240a-5958-3d046fe69cdb'
2016/12/01 15:55:13.665753 [DEBUG] client: starting task context for 'agent' (alloc 'b11047c8-ad1c-240a-5958-3d046fe69cdb')
2016/12/01 15:55:13.665928 [DEBUG] client: waiting for Vault token for task agent in alloc "b11047c8-ad1c-240a-5958-3d046fe69cdb"
2016/12/01 15:55:13.666572 [ERR] client.vault: failed to derive vault tokens: Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist
2016/12/01 15:55:13.666586 [ERR] client: failed to derive Vault token for task agent on alloc "b11047c8-ad1c-240a-5958-3d046fe69cdb": Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist
2016/12/01 15:55:13.666596 [DEBUG] client: killing task agent for alloc "b11047c8-ad1c-240a-5958-3d046fe69cdb": vault: failed to derive token: Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist

@jippi @diptanu

@jippi
Copy link
Contributor

jippi commented Dec 1, 2016

I got a a gist with same issue here

https://gist.github.com/jippi/8b8a10ae2943f58063926611d920651e#file-gistfile1-txt-L26-L30

Alloc ID b739058d-f5fa-a668-a3a1-965c22f7d45c is the failed
Alloc ID 5e82920d-c8f1-b58c-e658-e8c3374c33a4 is the resubmitted alloc for the exact same job

@dadgar
Copy link
Contributor

dadgar commented Dec 1, 2016

Amazing what one boolean can do :) Sorry you all ran into this!

@jippi
Copy link
Contributor

jippi commented Dec 1, 2016

@dadgar i saw this in my server logs

nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:25.256117 [DEBUG] http: Request /v1/allocation/85a65983-5365-979f-a753-24954ad6c6d9?index=190&wait=10000ms (10.00699977s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:26 [DEBUG] memberlist: TCP connection from=127.0.0.1:29826
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:27.553697 [DEBUG] http: Request /v1/agent/members?wait=60000ms (380.564µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:29.582595 [DEBUG] http: Request /v1/status/peers (855.497µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:30.576559 [DEBUG] http: Request /v1/status/leader?wait=60000ms (835.412µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:31.957122 [DEBUG] http: Request /v1/jobs?index=177&wait=60000ms (10.485035003s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:31.959884 [DEBUG] http: Request /v1/evaluations?index=178&wait=60000ms (13.585154802s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.070665 [DEBUG] http: Request /v1/allocations?index=191&wait=60000ms (215.785663ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.084001 [DEBUG] http: Request /v1/evaluations?index=193&wait=60000ms (122.137931ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.132994 [DEBUG] http: Request /v1/allocations?index=194&wait=60000ms (59.04641ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.211701 [DEBUG] http: Request /v1/allocation/85a65983-5365-979f-a753-24954ad6c6d9?index=1&wait=60000ms (1.294446ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.308436 [DEBUG] worker: dequeued evaluation 16d38c9f-afcf-65ef-fc48-5deb1673f18e
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.308691 [DEBUG] http: Request /v1/evaluations?index=195&wait=60000ms (222.080217ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.408943 [DEBUG] sched: <Eval '16d38c9f-afcf-65ef-fc48-5deb1673f18e' JobID: 'insights-worker'>: allocs: (place 9) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.420846 [DEBUG] http: Request /v1/jobs?index=198&wait=60000ms (111.403128ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.420921 [DEBUG] worker: submitted plan for evaluation 16d38c9f-afcf-65ef-fc48-5deb1673f18e
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.421443 [DEBUG] sched: <Eval '16d38c9f-afcf-65ef-fc48-5deb1673f18e' JobID: 'insights-worker'>: setting status to complete
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.422136 [DEBUG] http: Request /v1/allocations?index=197&wait=60000ms (124.062728ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.425240 [DEBUG] worker: updated evaluation <Eval '16d38c9f-afcf-65ef-fc48-5deb1673f18e' JobID: 'insights-worker'>
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.426181 [DEBUG] worker: ack for evaluation 16d38c9f-afcf-65ef-fc48-5deb1673f18e
nomad-stdout---supervisor-wbwFty.log:    2016/12/01 15:32:32.426779 [ERR] nomad.client: DeriveVaultToken failed (recoverable false): Allocation "b739058d-f5fa-a668-a3a1-965c22f7d45c" does not exist
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.480758 [ERR] nomad.node: Vault token creation failed: failed to create token for task "server": Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.480787 [ERR] nomad.node: Vault token revocation failed: Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.533341 [DEBUG] http: Request /v1/allocations?index=200&wait=60000ms (107.325842ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.733223 [DEBUG] http: Request /v1/allocations?index=208&wait=60000ms (44.997123ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.842071 [DEBUG] http: Request /v1/allocations?index=209&wait=60000ms (106.692464ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:36 [DEBUG] memberlist: TCP connection from=127.0.0.1:29860
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:37.481785 [ERR] nomad.node: Vault token creation failed: failed to create token for task "server": Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:37.482272 [ERR] nomad.node: Vault token revocation failed: Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:37.567981 [DEBUG] http: Request /v1/status/leader?wait=60000ms (720.629µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:39.587235 [DEBUG] http: Request /v1/status/peers (3.062137ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:40.248259 [DEBUG] http: Request /v1/nodes?index=97&wait=60000ms (1m1.399233777s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:40.579541 [DEBUG] http: Request /v1/agent/members?wait=60000ms (182.329µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:45.369737 [DEBUG] http: Request /v1/nodes?index=97&wait=60000ms (1m1.729887962s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:46 [DEBUG] memberlist: TCP connection from=127.0.0.1:29948
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:49.590262 [DEBUG] http: Request /v1/status/peers (767.4µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51.617006 [DEBUG] http: Request /v1/evaluations?index=201&wait=60000ms (19.185764303s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51.760506 [DEBUG] http: Request /v1/evaluations?index=212&wait=60000ms (140.72864ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51.760506 [DEBUG] http: Request /v1/evaluations?index=212&wait=60000ms (141.04109ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51 [DEBUG] memberlist: Initiating push/pull sync with: 195.154.35.151:4648
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:55.477508 [DEBUG] http: Request /v1/allocations?index=219&wait=60000ms (181.799344ms)

@jippi
Copy link
Contributor

jippi commented Dec 1, 2016

@dadgar emphasis on Vault client not active

@dadgar
Copy link
Contributor

dadgar commented Dec 1, 2016

Yeah the PR I just made will address that

@dadgar
Copy link
Contributor

dadgar commented Dec 1, 2016

Subtle bug!

@jippi
Copy link
Contributor

jippi commented Dec 1, 2016

@dadgar good job!

@blalor
Copy link
Contributor Author

blalor commented Dec 1, 2016

To boolean, or ! to boolean, that is the question.

@burdandrei
Copy link
Contributor

@dadgar just received this on 0.7.0
logs from master or client can help?

@dadgar
Copy link
Contributor

dadgar commented Jan 23, 2018

@burdandrei I would grab both

@burdandrei
Copy link
Contributor

Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]: 2018-01-23T17:54:53.052Z [DEBUG] plugin.nomad: plugin address: timestamp=2018-01-23T17:54:53.052Z address=/tmp/plugin053837145 network=unix
Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:54:53.243216 [INFO] driver.docker: created container 765b1b7fefa45b801f62f749a383cffba1a814152f39b7f2271343939e3d9af2
Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:54:53.349977 [INFO] driver.docker: started container 765b1b7fefa45b801f62f749a383cffba1a814152f39b7f2271343939e3d9af2
Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:54:53.361809 [WARN] client: error fetching stats of task telegraf: stats collection hasn't started yet
Jan 23 17:55:51 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:55:51.571444 [ERR] client.vault: failed to derive vault tokens: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:55:51 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:55:51.571469 [ERR] client.vault: failed to derive token for allocation "df8f5fdc-e431-3e85-5e5f-fce74491ac7a" and tasks [unicorn]: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:55:51 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:55:51.571477 [ERR] client: failed to derive Vault token for task unicorn on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers); retrying in 5s
Jan 23 17:56:56 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:56:56.573117 [ERR] client.vault: failed to derive vault tokens: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:56:56 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:56:56.573145 [ERR] client.vault: failed to derive token for allocation "df8f5fdc-e431-3e85-5e5f-fce74491ac7a" and tasks [unicorn]: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:56:56 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:56:56.573154 [ERR] client: failed to derive Vault token for task unicorn on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers); retrying in 20s

this is from client, same on server.

minute before it changed to

Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) initiating run
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.938211 [ERR] client.vault: failed to derive vault tokens: failed to create token for task "unicorn" on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": Connection to Vault failed: 2 error(s) occurred:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: POST https://vault.yotpo.com:8200/v1/sys/capabilities-self
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * failed to lookup role "nomad-cluster": Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: GET https://vault.yotpo.com:8200/v1/auth/token/roles/nomad-cluster
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.938231 [ERR] client.vault: failed to derive token for allocation "403b5afa-3304-8673-1044-5a37d1723f5a" and tasks [unicorn]: failed to create token for task "unicorn" on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": Connection to Vault failed: 2 error(s) occurred:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: POST https://vault.yotpo.com:8200/v1/sys/capabilities-self
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * failed to lookup role "nomad-cluster": Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: GET https://vault.yotpo.com:8200/v1/auth/token/roles/nomad-cluster
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.938238 [ERR] client: failed to derive Vault token for task unicorn on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": failed to create token for task "unicorn" on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": Connection to Vault failed: 2 error(s) occurred:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: POST https://vault.yotpo.com:8200/v1/sys/capabilities-self
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * failed to lookup role "nomad-cluster": Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: GET https://vault.yotpo.com:8200/v1/auth/token/roles/nomad-cluster
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) stopping
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.945649 [INFO] client: marking allocation 403b5afa-3304-8673-1044-5a37d1723f5a for GC
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/403b5afa-3304-8673-1044-5a37d1723f5a/telegraf/local/telegraf.conf"
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) received finish

in consul all the cluster is grean and is showing that it's unsealed!
Vault 0.8.3

@burdandrei
Copy link
Contributor

@dadgar you can catch me on Vault gitter now, before i started to restart vaults =)

@dadgar
Copy link
Contributor

dadgar commented Jan 23, 2018

@burdandrei Either send your servers a SIGHUP or restart them. The Vault server was sealed and the Nomad server has given up contacting the Vault. I filled #3786 to improve this behavior.

@burdandrei
Copy link
Contributor

@dadgar it's working now.
I'll check the logs, but during all the accident all 5 vaults were unsealed.
i restarted the leader just in case, new leader was re-elected and all worked great.
Nomad on the other side i think was trying to contact vault with not renewed token.
during the accident I talked to @jippi and he says that they are still seeing the same problem from time to time. i'll try to gather vault audit logs tomorrow and see what we got there.
thanks for super fast responses!

@tgross tgross modified the milestones: near-term, unscheduled Jan 9, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants