identity: add support for multiple identities + audiences #18123

schmichael · 2023-08-02T00:59:15Z

The Good

Allow for multiple identity{} blocks on tasks with distinct names and identities. The existing identity is referred to internally as the default Nomad identity since it is intended for the workload to identify itself to Nomad's API. Other identities are referred to as alternate identities as they're optional and for use with 3rd party services (Consul, Vault, Traefik, user apps, etc).

Validation does not prevent creating "hidden" identities (identities with both env=false and file=false) because followup PRs are likely to make use of that (just like we already do for the default identity).

The WIDMgr (Workload ID Manager) is currently a light wrapper around the Alloc.GetIdentities RPC, but could someday implement optimizations such as batching requests. Since identity minting is purely CPU bound I'm not sure if that will ever be necessary, but encapsulation for encapsulation's sake is nice sometimes too.

I did choose to duplicate the token's expiration time within the Alloc.GetIdentities RPC's response instead of the approach in the PoC branch #17434 which actually validated the JWTs and plucked out the expirations. Making identity fetching depend on public key fetching seemed needlessly complex, so I chose to just duplicate/de-normalize the expiry.

The way Alloc.GetIdentities uses Stale+WaitIndex is a bit unique: since the JWTs themselves are stateless, the WaitIndex is only used to ensure the server serving the request knows the Alloc exists (otherwise races between Client's reading allocs and servers processing Raft logs could cause spurious failures; not to mention the problem of restoring servers being arbitrarily stale). As long as the RPC can tell the alloc existed, any server can sign a JWT. So despite using Stale+WaitIndex, we don't use a blocking query to wait for some state to change and impact identities. Even when I add expiration support the server can't use a blocking query to hold off clients until they really need a new identity because servers won't record the expiration of tokens.

The Bad

This doesn't implement the Alloc.GetAllocs-includes-JWTs optimization of the proof of concept branch. I plan on moving that over, but the plumbing is ugly and felt like a distraction from these core bits. It is only an optimization and does not change functionality.

The current implementation of Alloc.GetIdentities/WIDMgr will block on contacting servers on agent restart or node reboot. This prevents using alternate identities on disconnected nodes. Using PrivateDir to store the JWTs across restarts seems necessary, but I don't have a tidy solution across reboots.

The Ugly

The HCL/JSON story is a bit awkward:

jobspec2/parse.go has to pluck the default Identity out of the Identities slice to put it back on Identity in case a 1.7 CLI is posting to a 1.6 API.
WorkloadIdentity.Canonicalize has to do the same dance in case 1.6 JSON is sent to a 1.7 API.

But it seems to work! The default identity stays in place, and all of the new identities are in the slice.

Intentionally Delaying

Docs
Changelog
Without expiration there's still no way to use this securely, so I'm going to "soft launch" until then.

tgross

Looking pretty good so far!

tgross · 2023-08-02T14:04:01Z

client/allocrunner/taskrunner/task_runner_getters.go

@@ -90,12 +90,8 @@ func (tr *TaskRunner) setNomadToken(token string) {
 	defer tr.nomadTokenLock.Unlock()
 	tr.nomadToken = token

-	if id := tr.task.Identity; id != nil {


Did we miss a bug here by not having the && id.Env conditional previously?

No we just left it up to taskenv.Builder whether or not to include the token in the env by also storing the injectWorkloadToken flag. It's what we do for the Vault token too, and honestly I don't know why.

client/taskenv/env.go

tgross · 2023-08-02T14:14:47Z

client/taskenv/env.go

+	for name, token := range b.workloadTokens {
+		envMap[WorkloadToken+"_"+name] = token
 	}


I know it's not in the RFC except as a future work item, but I'm thinking about how this might support group-scoped identities in the future. If we don't change the name for the scope, I think we end up with a nice way to have task-scope identities override group-scope identities. For example:

group { identity { name = "foo" aud = ["job.example.com"] } task { identity { name = "foo" aud = ["task.example.com"] } } }

But wanted to note that here just in case you're thinking that might not be a great idea. The service-scoped identity blocks I've been writing about in the Consul Workload Identity RFC don't care either way because those won't get exposed directly to tasks.

Hm, that would work, but do users need it? My kneejerk reaction is that we should enforce identity names to be unique within a task group. That neatly solves env var name conflicts as well.

However if we do other logic by identity name (eg Consul and Vault) I can see why defaulting-at-group and overriding-in-task might be beneficial if not outright required for supporting all existing Consul and Vault behaviors.

Placement in group vs task defines the identity's scope which is nice and tidy. Overrides don't change those semantics but risk similar usability issues as variable shadowing.

However if we do other logic by identity name (eg Consul and Vault)

The vault block can be defined at the group level but gets consumed at the task level, so I think we don't need to worry about that. For Consul we'll likely be defining them on service blocks, which I think are unique per task group anyways? So I think we're ok with leaving this unchanged.

client/widmgr/widmgr.go

client/allocrunner/taskrunner/identity_hook.go

nomad/alloc_endpoint.go

tgross · 2023-08-02T15:03:02Z

nomad/alloc_endpoint.go

+			now := time.Now().UTC()
+			maxIndex := uint64(0)
+			for _, idReq := range args.Identities {
+				out, err := state.AllocByID(ws, idReq.AllocID)


Each blocking run function call operates on a snapshot (if I'm understanding the code in rpc.go correctly), so the allocations we got above should be unchanged by the time we get here. Can we reuse them? Should we add to the list of rejections when we first try to get the allocation?

Yes! You made me realize I was looking up the alloc for each identity, despite the current implementation only sending identities for a single alloc in a single request... so we'd just load the same alloc over and over. 🤦

I fixed that and then reused the allocs we look up.

The only potential downside I see is that this code now diverges wildly from other blocking queries. I didn't do an exhaustive search but most if not all existing blocking queries seem to have return blockingRPC(...) at the end and stuff all of their post-auth/post-validation logic into that blockingRPC callback.

I don't see a reason this is a problem, but after so many times copying and pasting (:grimacing:) the same RPC code around it does feel funny to go my own way. The way the callback mutates some captured variables for use after the blocking callback (allocs and thresholdMet) feels icky, but I think that's my only complaint.

In fact this split should make it far easier to reuse the bottom identity creation bits in Alloc.GetAllocs when we implement the "create alternate identities when allocs are first fetched" optimization!

(Sorry for the long ramble. Even though the tests over this are fairly comprehensive I'm really having to talk myself into straying from the well trod return blockingRPC(...) path.)

tgross · 2023-08-02T15:06:02Z

nomad/alloc_endpoint.go

+				//only be called by 1.6 clients
+
+				widFound := false
+				for _, wid := range task.Identities {


Don't we canonicalize the task.Identities so there's a task.Identity set with that name? In which case, shouldn't we sign that identity as well here?

Task.Identity will still be handled in the plan applier and committed to raft... both in this PR and in my PoC branch.

Once I ran into the "oh no how can we ever upgrade 1.5/1.6 tokens in env vars" issue mentioned above, I started avoiding touching the default Identity. This is not to say we should necessarily ship 1.7 that way, just saying that (a) I haven't figured out an upgrade path yet and (b) if/when we do figure it out I think it will be nice to have in one nice tidy PR.

You did make me realize there's a bug here: I don't short circuit if there's 0 alternate identities, so the task crashes. Fixing (and adding tests obviously!)

tgross

Looks like there's a fresh pile of merge conflicts to resolve, but once those are fixed this LGTM!

Co-authored-by: Tim Gross <tgross@hashicorp.com>

schmichael · 2023-08-15T00:00:38Z

Rebased, merged, and giving CI a chance to make sure I didn't break something.

Encoded JWTs include an `alg` header key that tells the verifier which signature algorithm to use. Bafflingly, the JWT standard allows a value of `"none"` here which bypasses signature verification. In all shipped versions of Nomad, we explicitly configure verification to a specific algorithm and ignore the header value entirely to avoid this protocol flaw. But in #18123 we updated our JWT library to `go-jose`, which rightfully doesn't support `"none"` but this detail isn't encoded anywhere in our code base. Add a test that ensures we catch any regressions in the library.

schmichael requested a review from tgross August 2, 2023 00:59

tgross reviewed Aug 2, 2023

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui August 2, 2023 16:01 View deployment

schmichael mentioned this pull request Aug 2, 2023

Add support for transparent authentication to the Task API #18125

Open

vercel bot deployed to Preview – nomad-storybook-and-ui August 2, 2023 23:07 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 3, 2023 00:20 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 3, 2023 21:24 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 4, 2023 00:12 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 5, 2023 00:39 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 7, 2023 23:48 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 8, 2023 00:12 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 8, 2023 18:43 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 8, 2023 23:15 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 8, 2023 23:35 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 8, 2023 23:59 View deployment

schmichael marked this pull request as ready for review August 9, 2023 03:33

schmichael requested a review from tgross August 9, 2023 03:34

vercel bot deployed to Preview – nomad-storybook-and-ui August 9, 2023 03:36 View deployment

tgross approved these changes Aug 14, 2023

View reviewed changes

identity: support for multiple identities + aud

70a32fc

Co-authored-by: Tim Gross <tgross@hashicorp.com>

schmichael force-pushed the f-alt-wid branch from 04aa490 to 70a32fc Compare August 14, 2023 23:58

busld

4c34f19

vercel bot deployed to Preview – nomad-storybook-and-ui August 15, 2023 00:09 View deployment

schmichael merged commit 0e22fc1 into main Aug 15, 2023
21 checks passed

schmichael deleted the f-alt-wid branch August 15, 2023 16:11

tgross mentioned this pull request Sep 26, 2023

WI: add test to verify we don't allow empty signatures for JWT #18586

Merged

schmichael mentioned this pull request Oct 31, 2023

docs: changelog & basic docs for 1.7 WI changes #18936

Merged

tgross mentioned this pull request Jan 2, 2024

After upgrading from 1.6 to 1.7 receiving errors on nomad variable jobs #19555

Closed

tgross mentioned this pull request Jun 28, 2024

Multiple Workload Identities through multiple identity blocks in jobspec #16194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identity: add support for multiple identities + audiences #18123

identity: add support for multiple identities + audiences #18123

schmichael commented Aug 2, 2023 •

edited

Loading

tgross left a comment

tgross Aug 2, 2023

schmichael Aug 2, 2023

tgross Aug 2, 2023

schmichael Aug 5, 2023

tgross Aug 7, 2023

tgross Aug 2, 2023

schmichael Aug 7, 2023

tgross Aug 2, 2023

schmichael Aug 2, 2023

tgross left a comment

schmichael commented Aug 15, 2023

identity: add support for multiple identities + audiences #18123

identity: add support for multiple identities + audiences #18123

Conversation

schmichael commented Aug 2, 2023 • edited Loading

The Good

The Bad

The Ugly

Intentionally Delaying

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

schmichael commented Aug 15, 2023

schmichael commented Aug 2, 2023 •

edited

Loading