Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault token renewal fails after some time #4372

Closed
Tethik opened this issue Jun 5, 2018 · 16 comments
Closed

Vault token renewal fails after some time #4372

Tethik opened this issue Jun 5, 2018 · 16 comments

Comments

@Tethik
Copy link

Tethik commented Jun 5, 2018

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

AWS Linux

Issue

After running our nomad cluster for a while, the vault token that we give to the nomad server seems to have expired somehow.

The initial token given to Nomad looks like something like this:

Key                 Value
---                 -----
accessor           <snip>
creation_time       1528191049
creation_ttl        604800
display_name        token
entity_id           n/a
expire_time         2018-06-12T09:30:49.488155366Z
explicit_max_ttl    0
id                  <snip>
issue_time          2018-06-05T09:30:49.488154813Z
meta                <nil>
num_uses            0
orphan              true
path                auth/token/create-orphan
policies            [default nomad-server services_production_read services_staging_read]
renewable           true
ttl                 604798

So tokens last for a week.

My last new deployment of nomad servers was on the 30th of May. This should have given them tokens that shouldn't have expired until 6th of June. However since yesterday (4th of June) it started failing already.

I'm a bit at a loss to what's happening, and how I should proceed with debugging this.

Nomad Server logs (if appropriate)

With grep vault

==> Loaded configuration from /opt/nomad/config/autopilot.hcl, /opt/nomad/config/default.hcl, /opt/nomad/config/vault.hcl
    2018/04/30 14:45:14.379337 [INFO] vault: renewing token in 83h59m59.999994714s
    2018/05/04 02:45:14.393687 [INFO] vault: renewing token in 83h59m59.999995108s
    2018/05/07 14:45:14.408189 [INFO] vault: renewing token in 83h59m59.999994716s
    2018/05/11 02:45:14.419571 [INFO] vault: renewing token in 83h59m59.999995657s
    2018/05/14 14:45:14.431115 [INFO] vault: renewing token in 83h59m59.999995184s
    2018/05/18 02:45:14.442407 [INFO] vault: renewing token in 83h59m59.9999957s
    2018/05/21 14:45:14.458012 [INFO] vault: renewing token in 83h59m59.99999604s
    2018/05/25 02:45:14.470003 [INFO] vault: renewing token in 83h59m59.999995782s
    2018/05/28 14:45:14.481601 [INFO] vault: renewing token in 83h59m59.999996189s
    2018/06/01 02:45:14.505720 [INFO] vault: renewing token in 83h59m59.999995401s
    2018/06/04 13:55:30.666020 [WARN] vault: failed to revoke tokens. Will reattempt until TTL: failed to revoke token (alloc: "19821824-5e81-b9ae-4dcb-7c529644d3c5", node: "b10b91be-e22e-3afd-021a-a65ef7e31253", task: ""): Error making API request.
    2018/06/04 14:45:14.520487 [WARN] vault: got error or bad auth, so backing off: Error making API request.
    2018/06/04 14:45:14.520915 [INFO] vault: backing off for 5s
    2018/06/04 14:45:19.536103 [WARN] vault: got error or bad auth, so backing off: Error making API request.
    2018/06/04 14:45:19.536433 [INFO] vault: backing off for 12s
    2018/06/04 14:45:31.550533 [WARN] vault: got error or bad auth, so backing off: Error making API request.
    2018/06/04 14:45:31.550862 [INFO] vault: backing off for 24s
    2018/06/04 14:45:55.565131 [WARN] vault: got error or bad auth, so backing off: Error making API request.

More specifically I see this for every renewal attempt that fails:

URL: PUT https://52.28.132.68:8200/v1/auth/token/renew-self
Code: 403. Errors:

* permission denied
    2018/06/04 14:45:14.520915 [INFO] vault: backing off for 5s
    2018/06/04 14:45:19.536103 [WARN] vault: got error or bad auth, so backing off: Error making API request.

In the vault logs I also see lines like this

2018/06/01 14:45:05.390149 [INFO ] expiration: revoked lease: lease_id=auth/token/create-orphan/
@chelseakomlo
Copy link
Contributor

chelseakomlo commented Jun 5, 2018

See Vault's documentation for renewing tokens: https://www.vaultproject.io/docs/commands/token/renew.html. The Nomad process must be sent a SIGHUP signal to renew the Vault configuration, see https://www.nomadproject.io/docs/agent/configuration/vault.html#vault-configuration-reloads for more information.

Let us know if this solves this issue.

@jippi
Copy link
Contributor

jippi commented Jun 5, 2018

@chelseakomlo so its expected of the nomad operator to SIGHUP the nomad servers on a specific interval to ensure Vault keep a fresh set of vault tokens? why won't nomad maintain them automatically?

@Tethik
Copy link
Author

Tethik commented Jun 5, 2018

Thanks for the replies.

I would assume the nomad server is already renewing the tokens given the log statements before, i.e.

2018/04/30 14:45:14.379337 [INFO] vault: renewing token in 83h59m59.999994714s

Only having glanced at the code this seems to be doing a renewal. The vault integration guide also says that the "Nomad servers will renew the token automatically". https://www.nomadproject.io/docs/vault-integration/index.html

Given the log statements, it looks like for a few days it did successfully renew the token, but then yesterday it failed for some reason.

@Tethik
Copy link
Author

Tethik commented Jun 5, 2018

I think I tried manually creating a new token the same way I provision it and sending the SIGHUP to reload, but it did not seem to work. I'll try it next time though (which if this happens again should be next week 🎉). Hopefully it can be a workaround until this is resolved.

@preetapan
Copy link
Contributor

preetapan commented Jun 5, 2018

@jippi and @Tethik - Nomad servers will renew the token you provided it close to when it gets to half of the remaining TTL. However, its possible that if that token was revoked entirely in vault, or the operator wants to replace it with another one, so we pointed out the SIGHUP docs above if you want to change or update the token given to Nomad. Sorry for the confusion if any.

@Tethik - From your logs above, its not clear to me what changed upstream in Vault, unfortunately the error message on Nomad's side error making API request without additional context is not that useful. Is there anything else in the Vault logs other that what you already shared before? One thing I've seen before is sometimes vault error messages span multiple lines, so if you grep for vault with --after 10 you may be able to see more after the error making API request part.

@chelseakomlo
Copy link
Contributor

Apologies for the confusion- as Preetha mentioned above, Nomad servers will maintain tokens on the fly, but if a token has been revoked in Vault, tokens can be updated for the Nomad agent via SIGHUP.

More information would be helpful for us to diagnose this issue. Can you provide the following:

@Tethik
Copy link
Author

Tethik commented Jun 6, 2018

Here are some more verbose logs w/ grep vault --after 10 --before 10 from one of the failing servers.
https://gist.github.com/Tethik/13eb0c3cba2e53f642b8d70651aeabaa

The first failure shows at 2018/06/04 13:55:30.666020 when nomad tries to revoke an old token. I'm thinking that like you said @preetapan it could also be an issue in vault where the token might have been revoked. I'm pretty certain that nobody manually revoked the token.

vault token lookup I already included above (from a fresh token created the same way)

Key                 Value
---                 -----
accessor           <snip>
creation_time       1528191049
creation_ttl        604800
display_name        token
entity_id           n/a
expire_time         2018-06-12T09:30:49.488155366Z
explicit_max_ttl    0
id                       <snip>
issue_time          2018-06-05T09:30:49.488154813Z
meta                <nil>
num_uses            0
orphan              true
path                auth/token/create-orphan
policies            [default nomad-server services_production_read services_staging_read]
renewable           true
ttl                 604798

Our vault version: v0.9.3 ('5acd6a21d5a69ab49d0f7c0bf540123a9b2c696d')

@chelseakomlo
Copy link
Contributor

Thanks for including this information. Can you also include the Nomad agent's Vault policy? https://www.nomadproject.io/docs/vault-integration/index.html#required-vault-policies

@Tethik
Copy link
Author

Tethik commented Jun 6, 2018

Here's the nomad-server policy.

# Allow creating tokens under "nomad-cluster" token role. The token role name
# should be updated if "nomad-cluster" is not used.
path "auth/token/create/nomad-cluster" {
  capabilities = ["update"]
}

# Allow looking up "nomad-cluster" token role. The token role name should be
# updated if "nomad-cluster" is not used.
path "auth/token/roles/nomad-cluster" {
  capabilities = ["read"]
}

# Allow creating orphan tokens
path "auth/token/create-orphan" {
  capabilities = ["create", "update"]
}

# Allow looking up the token passed to Nomad to validate # the token has the
# proper capabilities. This is provided by the "default" policy.
path "auth/token/lookup-self" {
  capabilities = ["read"]
}

# Allow looking up incoming tokens to validate they have permissions to access
# the tokens they are requesting. This is only required if
# `allow_unauthenticated` is set to false.
path "auth/token/lookup" {
  capabilities = ["update"]
}

# Allow revoking tokens that should no longer exist. This allows revoking
# tokens for dead tasks.
path "auth/token/revoke-accessor" {
  capabilities = ["update"]
}

# Allow checking the capabilities of our own token. This is used to validate the
# token upon startup.
path "sys/capabilities-self" {
  capabilities = ["update"]
}

# Allow our own token to be renewed.
path "auth/token/renew-self" {
  capabilities = ["update"]
}

I noticed this difference from the documentation, not sure why I did this. Although this change to v0.8.1? might be related. #3992

# Allow creating orphan tokens
path "auth/token/create-orphan" {
  capabilities = ["create", "update"]
}

The auth/token/roles/nomad-cluster looks as follows.

Key                    Value
---                    -----
allowed_policies       [services_production_read services_staging_read]
disallowed_policies    [nomad-server]
explicit_max_ttl       0
name                   nomad-cluster
orphan                 true
path_suffix            n/a
period                 259200
renewable              true

@c4milo
Copy link
Contributor

c4milo commented Jun 6, 2018

@chelseakomlo @preetapan FWIW, I noticed nomad 0.8.3 does not update its vault token from HCL configurations upon sending it a SIGHUP signal when reloading from systemd. I had to restart the service in order to get it to pick up a newly set vault token. I can create a new issue if you think that behavior is not expected and is different to what it is being reported here.

@chelseakomlo
Copy link
Contributor

@c4milo thanks for notifying us about this issue- if you could open a new ticket with a description of steps to reproduce and the Nomad agent configuration/relevant logs, that would be helpful.

@chelseakomlo
Copy link
Contributor

@Tethik I tried reproducing this issue with a token with a period set to 1 minute, but was unable to do so. There isn't a code path for Nomad agents to revoke their own tokens, so this token must be revoked out of band.

If you turn on Vault audit logs, this should give a better idea of the token's lifecycle. I'm going to close this issue for now, but feel free to reopen with further Vault logs/audit logs that seem abnormal.

@Tethik
Copy link
Author

Tethik commented Jun 6, 2018

Thanks @chelseakomlo for taking the time and debugging. I appreciate it. I'll try out the suggestion for using audit logs.

@c4milo
Copy link
Contributor

c4milo commented Jun 6, 2018

@chelseakomlo, my specific issue may be difficult to happen in a real environment since Nomad servers are going to renew the token just fine if needed as you already know.

The way how I'm able to reproduce my particular issue is with a local Vagrant environment by closing my laptop and X time later opening it up 😬. I would see Nomad complaining about being unable to access Vault to renew the token. Then, I run an Ansible playbook to get a new token from Vault and place it in the Nomad servers config directory, within a HCL file. At the end of the playbook, systemd reload is issued, it sends a SIGHUP signal but Nomad does not pick up the token. It seems to be an edge case, unlikely to happen in real environments, do you still want me to report it?

@chelseakomlo
Copy link
Contributor

@c4milo FYI, we've reproduced the issue with not being able to reload Nomad's Vault configuration via SIGHUP, a fix for this will be included in the next Nomad release (0.8.4).

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants