Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity: add support for multiple identities + audiences #18123

Merged
merged 2 commits into from
Aug 15, 2023
Merged

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Aug 2, 2023

The Good

Allow for multiple identity{} blocks on tasks with distinct names and identities. The existing identity is referred to internally as the default Nomad identity since it is intended for the workload to identify itself to Nomad's API. Other identities are referred to as alternate identities as they're optional and for use with 3rd party services (Consul, Vault, Traefik, user apps, etc).

Validation does not prevent creating "hidden" identities (identities with both env=false and file=false) because followup PRs are likely to make use of that (just like we already do for the default identity).

The WIDMgr (Workload ID Manager) is currently a light wrapper around the Alloc.GetIdentities RPC, but could someday implement optimizations such as batching requests. Since identity minting is purely CPU bound I'm not sure if that will ever be necessary, but encapsulation for encapsulation's sake is nice sometimes too.

I did choose to duplicate the token's expiration time within the Alloc.GetIdentities RPC's response instead of the approach in the PoC branch #17434 which actually validated the JWTs and plucked out the expirations. Making identity fetching depend on public key fetching seemed needlessly complex, so I chose to just duplicate/de-normalize the expiry.

The way Alloc.GetIdentities uses Stale+WaitIndex is a bit unique: since the JWTs themselves are stateless, the WaitIndex is only used to ensure the server serving the request knows the Alloc exists (otherwise races between Client's reading allocs and servers processing Raft logs could cause spurious failures; not to mention the problem of restoring servers being arbitrarily stale). As long as the RPC can tell the alloc existed, any server can sign a JWT. So despite using Stale+WaitIndex, we don't use a blocking query to wait for some state to change and impact identities. Even when I add expiration support the server can't use a blocking query to hold off clients until they really need a new identity because servers won't record the expiration of tokens.

The Bad

This doesn't implement the Alloc.GetAllocs-includes-JWTs optimization of the proof of concept branch. I plan on moving that over, but the plumbing is ugly and felt like a distraction from these core bits. It is only an optimization and does not change functionality.

The current implementation of Alloc.GetIdentities/WIDMgr will block on contacting servers on agent restart or node reboot. This prevents using alternate identities on disconnected nodes. Using PrivateDir to store the JWTs across restarts seems necessary, but I don't have a tidy solution across reboots.

The Ugly

The HCL/JSON story is a bit awkward:

  1. jobspec2/parse.go has to pluck the default Identity out of the Identities slice to put it back on Identity in case a 1.7 CLI is posting to a 1.6 API.
  2. WorkloadIdentity.Canonicalize has to do the same dance in case 1.6 JSON is sent to a 1.7 API.

But it seems to work! The default identity stays in place, and all of the new identities are in the slice.

Intentionally Delaying

  • Docs
  • Changelog
    Without expiration there's still no way to use this securely, so I'm going to "soft launch" until then.

@schmichael schmichael requested a review from tgross August 2, 2023 00:59
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good so far!

@@ -90,12 +90,8 @@ func (tr *TaskRunner) setNomadToken(token string) {
defer tr.nomadTokenLock.Unlock()
tr.nomadToken = token

if id := tr.task.Identity; id != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we miss a bug here by not having the && id.Env conditional previously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we just left it up to taskenv.Builder whether or not to include the token in the env by also storing the injectWorkloadToken flag. It's what we do for the Vault token too, and honestly I don't know why.

client/taskenv/env.go Outdated Show resolved Hide resolved
Comment on lines +583 to 585
for name, token := range b.workloadTokens {
envMap[WorkloadToken+"_"+name] = token
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not in the RFC except as a future work item, but I'm thinking about how this might support group-scoped identities in the future. If we don't change the name for the scope, I think we end up with a nice way to have task-scope identities override group-scope identities. For example:

group {
  identity {
    name = "foo"
    aud = ["job.example.com"]
  }

  task {
    identity {
      name = "foo"
      aud = ["task.example.com"]
    }
  }

}

But wanted to note that here just in case you're thinking that might not be a great idea. The service-scoped identity blocks I've been writing about in the Consul Workload Identity RFC don't care either way because those won't get exposed directly to tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that would work, but do users need it? My kneejerk reaction is that we should enforce identity names to be unique within a task group. That neatly solves env var name conflicts as well.

However if we do other logic by identity name (eg Consul and Vault) I can see why defaulting-at-group and overriding-in-task might be beneficial if not outright required for supporting all existing Consul and Vault behaviors.

Placement in group vs task defines the identity's scope which is nice and tidy. Overrides don't change those semantics but risk similar usability issues as variable shadowing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However if we do other logic by identity name (eg Consul and Vault)

The vault block can be defined at the group level but gets consumed at the task level, so I think we don't need to worry about that. For Consul we'll likely be defining them on service blocks, which I think are unique per task group anyways? So I think we're ok with leaving this unchanged.

client/widmgr/widmgr.go Outdated Show resolved Hide resolved
client/allocrunner/taskrunner/identity_hook.go Outdated Show resolved Hide resolved
nomad/alloc_endpoint.go Show resolved Hide resolved
nomad/alloc_endpoint.go Outdated Show resolved Hide resolved
nomad/alloc_endpoint.go Outdated Show resolved Hide resolved
now := time.Now().UTC()
maxIndex := uint64(0)
for _, idReq := range args.Identities {
out, err := state.AllocByID(ws, idReq.AllocID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each blocking run function call operates on a snapshot (if I'm understanding the code in rpc.go correctly), so the allocations we got above should be unchanged by the time we get here. Can we reuse them? Should we add to the list of rejections when we first try to get the allocation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! You made me realize I was looking up the alloc for each identity, despite the current implementation only sending identities for a single alloc in a single request... so we'd just load the same alloc over and over. 🤦

I fixed that and then reused the allocs we look up.

The only potential downside I see is that this code now diverges wildly from other blocking queries. I didn't do an exhaustive search but most if not all existing blocking queries seem to have return blockingRPC(...) at the end and stuff all of their post-auth/post-validation logic into that blockingRPC callback.

I don't see a reason this is a problem, but after so many times copying and pasting (:grimacing:) the same RPC code around it does feel funny to go my own way. The way the callback mutates some captured variables for use after the blocking callback (allocs and thresholdMet) feels icky, but I think that's my only complaint.

In fact this split should make it far easier to reuse the bottom identity creation bits in Alloc.GetAllocs when we implement the "create alternate identities when allocs are first fetched" optimization!

(Sorry for the long ramble. Even though the tests over this are fairly comprehensive I'm really having to talk myself into straying from the well trod return blockingRPC(...) path.)

//only be called by 1.6 clients

widFound := false
for _, wid := range task.Identities {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we canonicalize the task.Identities so there's a task.Identity set with that name? In which case, shouldn't we sign that identity as well here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task.Identity will still be handled in the plan applier and committed to raft... both in this PR and in my PoC branch.

Once I ran into the "oh no how can we ever upgrade 1.5/1.6 tokens in env vars" issue mentioned above, I started avoiding touching the default Identity. This is not to say we should necessarily ship 1.7 that way, just saying that (a) I haven't figured out an upgrade path yet and (b) if/when we do figure it out I think it will be nice to have in one nice tidy PR.

You did make me realize there's a bug here: I don't short circuit if there's 0 alternate identities, so the task crashes. Fixing (and adding tests obviously!)

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a fresh pile of merge conflicts to resolve, but once those are fixed this LGTM!

Co-authored-by: Tim Gross <tgross@hashicorp.com>
@schmichael
Copy link
Member Author

Rebased, merged, and giving CI a chance to make sure I didn't break something.

@schmichael schmichael merged commit 0e22fc1 into main Aug 15, 2023
21 checks passed
@schmichael schmichael deleted the f-alt-wid branch August 15, 2023 16:11
tgross added a commit that referenced this pull request Sep 26, 2023
Encoded JWTs include an `alg` header key that tells the verifier which signature
algorithm to use. Bafflingly, the JWT standard allows a value of `"none"` here
which bypasses signature verification.

In all shipped versions of Nomad, we explicitly configure verification to a
specific algorithm and ignore the header value entirely to avoid this protocol
flaw. But in #18123 we updated our JWT library to `go-jose`, which rightfully
doesn't support `"none"` but this detail isn't encoded anywhere in our code
base. Add a test that ensures we catch any regressions in the library.
tgross added a commit that referenced this pull request Sep 26, 2023
Encoded JWTs include an `alg` header key that tells the verifier which signature
algorithm to use. Bafflingly, the JWT standard allows a value of `"none"` here
which bypasses signature verification.

In all shipped versions of Nomad, we explicitly configure verification to a
specific algorithm and ignore the header value entirely to avoid this protocol
flaw. But in #18123 we updated our JWT library to `go-jose`, which rightfully
doesn't support `"none"` but this detail isn't encoded anywhere in our code
base. Add a test that ensures we catch any regressions in the library.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants