-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identity: add support for multiple identities + audiences #18123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good so far!
@@ -90,12 +90,8 @@ func (tr *TaskRunner) setNomadToken(token string) { | |||
defer tr.nomadTokenLock.Unlock() | |||
tr.nomadToken = token | |||
|
|||
if id := tr.task.Identity; id != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we miss a bug here by not having the && id.Env
conditional previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we just left it up to taskenv.Builder whether or not to include the token in the env by also storing the injectWorkloadToken flag. It's what we do for the Vault token too, and honestly I don't know why.
for name, token := range b.workloadTokens { | ||
envMap[WorkloadToken+"_"+name] = token | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not in the RFC except as a future work item, but I'm thinking about how this might support group
-scoped identities in the future. If we don't change the name for the scope, I think we end up with a nice way to have task-scope identities override group-scope identities. For example:
group {
identity {
name = "foo"
aud = ["job.example.com"]
}
task {
identity {
name = "foo"
aud = ["task.example.com"]
}
}
}
But wanted to note that here just in case you're thinking that might not be a great idea. The service-scoped identity blocks I've been writing about in the Consul Workload Identity RFC don't care either way because those won't get exposed directly to tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that would work, but do users need it? My kneejerk reaction is that we should enforce identity names to be unique within a task group. That neatly solves env var name conflicts as well.
However if we do other logic by identity name (eg Consul and Vault) I can see why defaulting-at-group and overriding-in-task might be beneficial if not outright required for supporting all existing Consul and Vault behaviors.
Placement in group
vs task
defines the identity's scope which is nice and tidy. Overrides don't change those semantics but risk similar usability issues as variable shadowing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However if we do other logic by identity name (eg Consul and Vault)
The vault
block can be defined at the group level but gets consumed at the task level, so I think we don't need to worry about that. For Consul we'll likely be defining them on service
blocks, which I think are unique per task group anyways? So I think we're ok with leaving this unchanged.
nomad/alloc_endpoint.go
Outdated
now := time.Now().UTC() | ||
maxIndex := uint64(0) | ||
for _, idReq := range args.Identities { | ||
out, err := state.AllocByID(ws, idReq.AllocID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each blocking run
function call operates on a snapshot (if I'm understanding the code in rpc.go
correctly), so the allocations we got above should be unchanged by the time we get here. Can we reuse them? Should we add to the list of rejections when we first try to get the allocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! You made me realize I was looking up the alloc for each identity, despite the current implementation only sending identities for a single alloc in a single request... so we'd just load the same alloc over and over. 🤦
I fixed that and then reused the allocs we look up.
The only potential downside I see is that this code now diverges wildly from other blocking queries. I didn't do an exhaustive search but most if not all existing blocking queries seem to have return blockingRPC(...)
at the end and stuff all of their post-auth/post-validation logic into that blockingRPC callback.
I don't see a reason this is a problem, but after so many times copying and pasting (:grimacing:) the same RPC code around it does feel funny to go my own way. The way the callback mutates some captured variables for use after the blocking callback (allocs
and thresholdMet
) feels icky, but I think that's my only complaint.
In fact this split should make it far easier to reuse the bottom identity creation bits in Alloc.GetAllocs
when we implement the "create alternate identities when allocs are first fetched" optimization!
(Sorry for the long ramble. Even though the tests over this are fairly comprehensive I'm really having to talk myself into straying from the well trod return blockingRPC(...)
path.)
nomad/alloc_endpoint.go
Outdated
//only be called by 1.6 clients | ||
|
||
widFound := false | ||
for _, wid := range task.Identities { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we canonicalize the task.Identities
so there's a task.Identity
set with that name? In which case, shouldn't we sign that identity as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task.Identity
will still be handled in the plan applier and committed to raft... both in this PR and in my PoC branch.
Once I ran into the "oh no how can we ever upgrade 1.5/1.6 tokens in env vars" issue mentioned above, I started avoiding touching the default Identity. This is not to say we should necessarily ship 1.7 that way, just saying that (a) I haven't figured out an upgrade path yet and (b) if/when we do figure it out I think it will be nice to have in one nice tidy PR.
You did make me realize there's a bug here: I don't short circuit if there's 0 alternate identities, so the task crashes. Fixing (and adding tests obviously!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's a fresh pile of merge conflicts to resolve, but once those are fixed this LGTM!
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Rebased, merged, and giving CI a chance to make sure I didn't break something. |
Encoded JWTs include an `alg` header key that tells the verifier which signature algorithm to use. Bafflingly, the JWT standard allows a value of `"none"` here which bypasses signature verification. In all shipped versions of Nomad, we explicitly configure verification to a specific algorithm and ignore the header value entirely to avoid this protocol flaw. But in #18123 we updated our JWT library to `go-jose`, which rightfully doesn't support `"none"` but this detail isn't encoded anywhere in our code base. Add a test that ensures we catch any regressions in the library.
Encoded JWTs include an `alg` header key that tells the verifier which signature algorithm to use. Bafflingly, the JWT standard allows a value of `"none"` here which bypasses signature verification. In all shipped versions of Nomad, we explicitly configure verification to a specific algorithm and ignore the header value entirely to avoid this protocol flaw. But in #18123 we updated our JWT library to `go-jose`, which rightfully doesn't support `"none"` but this detail isn't encoded anywhere in our code base. Add a test that ensures we catch any regressions in the library.
The Good
Allow for multiple
identity{}
blocks on tasks with distinct names and identities. The existing identity is referred to internally as thedefault Nomad identity
since it is intended for the workload to identify itself to Nomad's API. Other identities are referred to asalternate identities
as they're optional and for use with 3rd party services (Consul, Vault, Traefik, user apps, etc).Validation does not prevent creating "hidden" identities (identities with both env=false and file=false) because followup PRs are likely to make use of that (just like we already do for the default identity).
The
WIDMgr
(Workload ID Manager) is currently a light wrapper around theAlloc.GetIdentities
RPC, but could someday implement optimizations such as batching requests. Since identity minting is purely CPU bound I'm not sure if that will ever be necessary, but encapsulation for encapsulation's sake is nice sometimes too.I did choose to duplicate the token's expiration time within the
Alloc.GetIdentities
RPC's response instead of the approach in the PoC branch #17434 which actually validated the JWTs and plucked out the expirations. Making identity fetching depend on public key fetching seemed needlessly complex, so I chose to just duplicate/de-normalize the expiry.The way
Alloc.GetIdentities
uses Stale+WaitIndex is a bit unique: since the JWTs themselves are stateless, the WaitIndex is only used to ensure the server serving the request knows the Alloc exists (otherwise races between Client's reading allocs and servers processing Raft logs could cause spurious failures; not to mention the problem of restoring servers being arbitrarily stale). As long as the RPC can tell the alloc existed, any server can sign a JWT. So despite using Stale+WaitIndex, we don't use a blocking query to wait for some state to change and impact identities. Even when I add expiration support the server can't use a blocking query to hold off clients until they really need a new identity because servers won't record the expiration of tokens.The Bad
This doesn't implement the
Alloc.GetAllocs
-includes-JWTs optimization of the proof of concept branch. I plan on moving that over, but the plumbing is ugly and felt like a distraction from these core bits. It is only an optimization and does not change functionality.The current implementation of
Alloc.GetIdentities
/WIDMgr
will block on contacting servers on agent restart or node reboot. This prevents using alternate identities on disconnected nodes. UsingPrivateDir
to store the JWTs across restarts seems necessary, but I don't have a tidy solution across reboots.The Ugly
The HCL/JSON story is a bit awkward:
But it seems to work! The default identity stays in place, and all of the new identities are in the slice.
Intentionally Delaying
Without expiration there's still no way to use this securely, so I'm going to "soft launch" until then.