Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad retries indefinitely to revoke non-existant SI token(s) #17833

Closed
seanamos opened this issue Jul 7, 2023 · 2 comments · Fixed by #17847
Closed

Nomad retries indefinitely to revoke non-existant SI token(s) #17833

seanamos opened this issue Jul 7, 2023 · 2 comments · Fixed by #17847
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug
Milestone

Comments

@seanamos
Copy link

seanamos commented Jul 7, 2023

Nomad version

v1.5.6

Issue

We were doing DR testing in one of our testing environments and did a snapshot restore to a 1 hour old consul snapshot. Since then, nomad has been trying to revoke SI tokens from Consul that don't exist:

# warnings in Nomad

failed to revoke SI token accessor (alloc "8e6140ce-d940-1028-83c8-ba2e306d0678", node "4cc0e923-7d21-1f02-b100-f9baa12dc654", task "xxx"): Unexpected response code: 404 (Cannot find token to delete)

Consul is generating lots of Cannot find token to delete errors as well.

These are the same tokens that are failing to delete (total attempts for each URL over the last 3 days):

Screenshot 2023-07-07 at 09 58 09

Questions:

  1. Nomad should possibly handle the 404 gracefully (token does not exist)?
  2. Is there a way to force nomad to forget about these SI tokens?
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 7, 2023
@tgross
Copy link
Member

tgross commented Jul 7, 2023

Hi @seanamos! I took a quick look and Nomad is assuming that this endpoint will not return an error (ref consul.go#L419-L422):

// Consul will no-op the deletion of a non-existent token (no error)
_, err := c.aclClient.TokenDelete(accessor.AccessorID, &api.WriteOptions{Namespace: accessor.ConsulNamespace})
return err

Unfortunately it looks like this assumption was broken in a breaking change in Consul 1.15.0:

  • acl errors: Delete and get requests now return descriptive errors when the specified resource cannot be found. Other ACL request errors provide more information about when a resource is missing. Add error for when the ACL system has not been bootstrapped.
    • Delete Token/Policy/AuthMethod/Role/BindingRule endpoints now return 404 when the resource cannot be found.
      • New error formats: "Requested * does not exist: ACL not found", "* not found in namespace $NAMESPACE: ACL not found"

This isn't otherwise documented in Consul's Delete Token API docs. So it looks like an obvious cross-compatibility bug and we'll need to fix that.

Is there a way to force nomad to forget about these SI tokens?

Unfortunately it looks like the accessors are stored in Raft and then whenever a new leader comes on it goes thru those accessors to determine which ones can be revoked. Let me check in with some of my colleagues to see if we can figure out as workaround for you.

@tgross tgross added theme/consul stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jul 7, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jul 7, 2023
@tgross tgross self-assigned this Jul 7, 2023
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jul 7, 2023
@tgross tgross removed their assignment Jul 7, 2023
tgross added a commit that referenced this issue Jul 7, 2023
In Consul 1.15.0, the Delete Token API was changed so as to return an error when
deleting a non-existent ACL token. This means that if Nomad successfully deletes
the token but fails to persist that fact, it will get stuck trying to delete a
non-existent token forever.

Update the token deletion function to ignore "not found" errors and treat them
as successful deletions.

Fixes: #17833
@tgross tgross self-assigned this Jul 7, 2023
@tgross tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Jul 7, 2023
@tgross tgross added this to the 1.6.0 milestone Jul 7, 2023
@tgross
Copy link
Member

tgross commented Jul 7, 2023

@seanamos I've got a patch up in #17847 for that. That'll get shipped in the upcoming 1.6.0 but also backported once 1.6.0 is GA (in the next couple weeks). But once that lands in main it'll get backported to the branch release/1.5.x and if you want you can build yourself a Nomad binary that will fix the problem for you locally.

tgross added a commit that referenced this issue Jul 7, 2023
In Consul 1.15.0, the Delete Token API was changed so as to return an error when
deleting a non-existent ACL token. This means that if Nomad successfully deletes
the token but fails to persist that fact, it will get stuck trying to delete a
non-existent token forever.

Update the token deletion function to ignore "not found" errors and treat them
as successful deletions.

Fixes: #17833
Nomad - Community Issues Triage automation moved this from In Progress to Done Jul 7, 2023
tgross added a commit that referenced this issue Jul 7, 2023
…17847)

In Consul 1.15.0, the Delete Token API was changed so as to return an error when
deleting a non-existent ACL token. This means that if Nomad successfully deletes
the token but fails to persist that fact, it will get stuck trying to delete a
non-existent token forever.

Update the token deletion function to ignore "not found" errors and treat them
as successful deletions.

Fixes: #17833
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants